Warning
This documentation is actively being updated as the project evolves and may not be complete in all areas.
JEP-0013: Metrics, Tracing, and Log Observability¶
Field |
Value |
|---|---|
JEP |
0013 |
Title |
Metrics, Tracing, and Log Observability |
Author(s) |
@mangelajo (Miguel Angel Ajo Pelayo miguelangel@ajo.es) |
Status |
Accepted |
Type |
Standards Track |
Created |
2026-04-23 |
Updated |
2026-05-04 |
Discussion |
https://github.com/jumpstarter-dev/jumpstarter/pull/631 |
Requires |
— |
Supersedes |
— |
Superseded-By |
— |
Abstract¶
This JEP defines an optional, cross-component observability model for Jumpstarter covering lease context metadata, structured operational events, exporter/driver metrics, and standardized logging. It targets direct integration with Prometheus (scrape), Loki (log aggregation), and Perses (dashboards) — without mandating OpenTelemetry — and introduces an optional in-cluster Jumpstarter Telemetry service that aggregates data from exporters and clients so that edge processes never need Loki or cluster-scrape credentials. Implementation is expected to land in phases; this JEP describes the end state and compatibility rules.
Phases¶
Phase |
Scope |
Key deliverables |
|---|---|---|
1 |
Structured logging + lease context |
|
2 |
Metrics endpoints |
|
3 |
Telemetry service |
Optional |
4 |
Exporter drivers telemetry |
Provides a clean architecture to let drivers generate their own telemetry data. |
5 |
In-cluster log scraping |
Operator configures log shipper integration (Promtail, Grafana Alloy, Vector) for Controller/Router pod logs; |
6 |
Dashboards + alerting |
Perses CRD dashboards; starter alert rules; documentation and operator integration. |
Each phase is independently useful and builds on the previous ones. Phase 1 can ship without any later phase; operators who only need structured logs benefit immediately. Phase 2 adds scrape-ready metrics without requiring the Telemetry service.
Motivation¶
Today, operators and CI maintainers need to answer questions that raw Kubernetes objects and ad hoc text logs do not always answer in one place:
Which pipeline or image was being tested on this lease?
How often do flashes fail on this exporter?
What lease or user correlates a controller line with a failure on the client?
The Lease API already models scheduling and assignment; it does
not yet provide a first-class, documented place for run metadata or a standard
for lease-scoped operational events (beyond generic conditions).
Exporters expose work to drivers, but there is no shared model for driver- or exporter-level metrics that a monitoring stack can scrape or receive.
User Stories¶
As a lab operator, I want to see flash success/failure rates per exporter in a Prometheus dashboard, so that I can spot failing hardware before CI teams notice.
As a CI pipeline author, I want to attach my build ID and image digest to a lease, so that post-mortem queries in Loki can filter all logs for one pipeline run across controller, exporter, and client.
As a platform engineer, I want exporter processes to send telemetry without holding Loki or Prometheus credentials, so that I do not have to distribute and rotate secrets on every lab machine.
As an AI agent orchestrating CI, I want machine-readable structured logs and metric exemplars with lease context, so that I can programmatically identify failing exporters and correlate test results without parsing free-form text.
Proposal¶
Concepts¶
Lease context — Identifiers and labels supplied by a client or CI and associated for the life of a lease, propagated where safe so metrics, logs, and traces can be filtered and joined.
Lease events (or operations) — Annotated, structured log entries recording significant actions (for example flash started, flash failed, image reference) with typed fields, queryable in Loki alongside regular logs and distinct from higher-frequency debug output (see DD-2).
Exporter metrics — Counters (operations, bytes), histograms (operation duration), and gauges (active sessions) exposed from the exporter and enriched by individual drivers via the
driver_typelabel. Each driver selects a category from a predefined set in jumpstarter core (e.g.storage,power,network,serial,console,video,composite). Composite drivers (e.g. Renode, QEMU) that bundle multiple sub-drivers do not emit a single top-level category for delegated work. Instead, each sub-driver emits its owndriver_typewhen it performs an operation — a Renode storage sub-driver emitsdriver_type="storage", its power sub-driver emitsdriver_type="power", and so on. Any top-level methods on the composite driver itself (e.g. VM lifecycle) emitdriver_type="composite".Jumpstarter Telemetry (optional) — a dedicated component that reverse-scrapes connected exporters for metrics via
MetricsStreamand receives structured logs viaPushLogs, using the same trust model (mTLS, ServiceAccount) as Controller/Router. It isolates Loki/series work from the reconciler hot path (see DD-7). Multi-replica HA with persistent exporter connections is covered in DD-8; best-effort log deduplication in DD-9.
What users see¶
When creating a lease, clients (or their tooling) can attach metadata via CRD fields and/or
spec.contextusing documented keys and size limits. Example keys might include a build / pipeline identifier, image digest, or VCS.The controller and/or data plane write structured, annotated log events (see DD-2) for significant operations such as flash attempts and outcomes.
Exporters maintain local
prometheus_clientcounters and open aMetricsStreamto the Jumpstarter Telemetry service over the existing exporter↔control-plane trust boundary. On each Prometheus scrape, the Telemetry service fans out to connected exporters and serves the merged/metricsoutput (see DD-3, DD-7), with cluster credentials — avoiding per-exporter Loki and metrics secrets. Exporters and clients also push structured log entries viaPushLogs(not unbounded default chatter — see Control-plane aggregation below).The
jmpCLI output remains human-readable, but when a Telemetry endpoint is available,jmpalso pushes structured JSON logs to the Jumpstarter Telemetry service for Loki ingest.
API / Protocol Changes¶
CRD (Lease)¶
Additive changes only for the spec.context field. Backwards compatibility
by making this field empty by default.
gRPC: Telemetry endpoint discovery (jumpstarter.proto)¶
A new RPC on the existing ControllerService lets both exporters and
clients discover the optional Telemetry endpoint:
// Added to ControllerService
rpc GetServiceEndpoints(GetServiceEndpointsRequest)
returns (GetServiceEndpointsResponse);
message GetServiceEndpointsRequest {}
message GetServiceEndpointsResponse {
// Empty when telemetry is not enabled.
repeated TelemetryEndpoint telemetry_endpoints = 1;
}
message TelemetryEndpoint {
string endpoint = 1; // gRPC address (host:port)
string certificate = 2; // Optional CA cert for the endpoint
string min_severity = 3; // Minimum severity to forward (e.g. "warning")
}
Exporters call GetServiceEndpoints after Register; clients call it
after authentication. An empty telemetry_endpoints list means telemetry
is not deployed — callers skip all telemetry RPCs. Older controllers
that do not implement the method return UNIMPLEMENTED, which callers
treat identically to an empty list.
gRPC: Telemetry service (telemetry.proto — new file)¶
A new protocol/proto/jumpstarter/v1/telemetry.proto defines the
TelemetryService implemented by jumpstarter-telemetry. It has two
RPCs: one for metrics (reverse scrape) and one for log push.
Metrics: reverse scrape via MetricsStream¶
Exporters maintain a local prometheus_client.CollectorRegistry with
counters, histograms, and gauges. Rather than pushing increments, the
exporter opens a persistent bidirectional stream to the Telemetry
service; the Telemetry service periodically sends a scrape request
and the exporter responds with the output of
prometheus_client.generate_latest() in OpenMetrics text format.
service TelemetryService {
// Persistent bidirectional stream: telemetry sends scrape requests,
// exporter responds with full metric snapshots.
rpc MetricsStream(stream MetricsStreamRequest)
returns (stream MetricsStreamResponse);
// Structured log / event push (used by both exporters and clients).
rpc PushLogs(PushLogsRequest) returns (PushLogsResponse);
}
// Exporter → Telemetry
message MetricsStreamRequest {
oneof msg {
MetricsRegister register = 1; // First message: identify this exporter
MetricsScrapeResponse scrape_response = 2; // Subsequent: reply to a scrape
}
}
message MetricsRegister {
string identity = 1; // Exporter CRD name (verified against mTLS and auth token by server)
}
message MetricsScrapeResponse {
bytes metrics_text = 1; // generate_latest() OpenMetrics output
google.protobuf.Timestamp timestamp = 2;
}
// Telemetry → Exporter
message MetricsStreamResponse {
oneof msg {
MetricsScrapeRequest scrape_request = 1;
}
}
message MetricsScrapeRequest {} // "send your /metrics now"
The stream lifecycle:
Exporter opens the stream and sends
MetricsRegister, the jumpstarter-telemetry service authenticates the exporter identity and labels from cluster information.When Prometheus (or any scraper) hits the Telemetry service’s
/metricsendpoint, Telemetry fans outMetricsScrapeRequestto all connected exporters.Each exporter calls
generate_latest(registry)and replies withMetricsScrapeResponse.Telemetry merges the responses and serves the combined result, adds and filters any necessary labels or exemplars from data. This on-demand approach avoids stale data and unnecessary background traffic; it can be changed to periodic pre-fetching later if scrape latency became problematic.
Client-side metrics are not collected. All metrically-interesting
operations are observable from the exporter side: DriverCall methods
run on the exporter and can be instrumented there. Client-side drivers
that orchestrate complex workflows (e.g. serial-console-driven
flashing) report outcomes back to the exporter via regular
DriverCall methods, keeping the exporter as the single source of
truth for metrics.
Logs: push via PushLogs¶
Both exporters and clients push structured log entries to the Telemetry service for Loki ingest:
message PushLogsRequest {
repeated LogEntry entries = 1;
}
message PushLogsResponse {
uint32 accepted = 1; // Entries accepted
uint32 dropped = 2; // Entries dropped (backpressure)
}
message LogEntry {
google.protobuf.Timestamp timestamp = 1;
string severity = 2; // debug, info, warning, error, critical
string message = 3;
string component = 4; // Log stream label: cli, exporter
string exporter = 5; // Log stream label: exporter CRD name
string lease_id = 6; // High-cardinality, log body only
string client = 7; // High-cardinality, log body only
string operation = 8; // flash, power, etc.
string result = 9; // success, failure
string driver_type = 10; // storage, power, network, etc.
map<string, string> extra_fields = 11; // Driver-specific structured data
}
The Telemetry service maps component and exporter to Loki stream
labels and everything else into the JSON body, following the
cardinality rules in Cardinality guidelines. The exporter and
client fields are verified server-side with the authenticated
identity to prevent impersonation. spec.context entries associated
with the active lease (e.g. build_id, image_digest) are placed in
extra_fields by the caller. Empty fields or details
that can be obtained from lease_id are incorporated into the log.
extra_fields limits: To prevent unbounded log payloads,
extra_fields is capped at 16 entries per LogEntry. Key names are
limited to 64 characters and values to 256 characters. The Telemetry
service enforces these limits server-side, silently truncating or
dropping entries that exceed them before forwarding to Loki.
String fields rationale: Fields such as severity, operation,
result, and driver_type are intentionally string rather than
protobuf enums. These values end up as JSON log body fields in Loki
where string representation is required regardless. Using strings
keeps the wire format forward-compatible (new categories or result
codes do not require a proto regeneration), and validation of allowed
values is enforced at the application layer using the operator’s
configuration (e.g. driverTypeEnum allowlist). The same reasoning
applies to extra_fields and structured_fields in
LogStreamResponse — they carry driver-specific key-value data
destined for log bodies, not typed metrics.
gRPC: AuditStream removal (jumpstarter.proto)¶
The existing AuditStream RPC on ControllerService and its
AuditStreamRequest message are removed. Analysis of the codebase
shows this is dead code:
The Go controller has no implementation — calls fall through to
UnimplementedControllerServiceServerwhich returnscodes.Unimplemented.No Python code (exporter or client) calls the RPC.
No tests exercise it beyond generated stubs.
Its intended purpose (tracking exporter activity) is fully superseded
by TelemetryService.PushLogs with a richer, properly-designed
message format.
gRPC: LogStreamResponse enrichment (jumpstarter.proto)¶
The existing LogStream RPC on ExporterService is kept — it serves
a fundamentally different purpose (real-time session logs from
exporter to connected client) from the Telemetry log push. However,
the LogStreamResponse message is enriched with optional additive
fields to support richer client-side display and optional dual-path
forwarding to telemetry:
message LogStreamResponse {
string uuid = 1;
string severity = 2;
string message = 3;
optional LogSource source = 4;
// New additive fields:
optional string driver_type = 5; // Category when source=DRIVER
optional string operation = 6; // When the log is part of a known operation
optional google.protobuf.Timestamp timestamp = 7;
map<string, string> structured_fields = 8;
}
These fields are optional and backward compatible — older clients
ignore unknown fields; older exporters simply do not set them.
The same size limits as LogEntry.extra_fields apply to
structured_fields (16 entries, 64-char keys, 256-char values).
Tracing scope¶
This JEP covers correlation only — lease_id, trace_id,
and span_id are propagated as log fields and Prometheus exemplar keys so that
metrics, logs, and (future) traces can be joined. Full distributed tracing
(span creation, sampling policies, trace storage and visualization) is deferred
to a future JEP. Optional propagation of traceparent and lease
identifiers in gRPC metadata remains backward compatible (unknown
metadata ignored by older servers).
Hardware Considerations¶
No hardware considerations.
Design Decisions¶
DD-1: How lease-scoped context metadata is stored¶
Scope: This decision is about where to store generic metadata on a
Lease that describes why a run exists or where it came from — for example
an external build id, pipeline id, VCS revision, or other
operator-defined keys (team, environment), within the cardinality and
size limits defined in Cardinality guidelines. The same stored context
is the intended source to propagate (where safe) into metric series
labels and into log line fields for emissions that occur during the
lease and for logs produced during client access to the platform
(for example jmp) or during exporter and control-plane handling, so
Prometheus and Loki can correlate on one lease-level
identity without re-typing it on every line.
Alternatives considered:
Annotation and label only on the
Leaseobject — Kube-native, no spec change; limited size for annotations; labels for select queries only.Typed subfields under
spec(for exampleobservabilityorcontext) — easier validation, clearer API, migration path in CRD.Only client-side (environment / local config) — no cluster visibility; hard for operators to audit; no stable object-level link to per-lease metrics and server logs in the cluster.
Decision: (2) — a typed spec.context map under the Lease CRD for
first-class, validated context. (1) (labels/annotations) remains allowed
for integration with generic tooling that only understands Kubernetes metadata
or benefits from lease label filtering.
Rationale: Typed fields make validation and documentation clear; labels are still useful for selection and for tools that only understand metadata.
DD-2: Where operational events (flash, image) live¶
Alternatives considered:
Kubernetes
Eventobjects — built-in, TTL-limited, good for “what happened” inkubectl get eventsbut not long-term history by default.Lease.status.conditionsonly — compact but poor for a sequence of operations with payloads (image id, size).Dedicated CRD (for example per-event or a single stream object) — more design and RBAC, better long-term retention and querying if backed properly.
Annotated log events Provides a lightweight alternative that can be traced and filtered along logs.
Decision: (4), since the other alternatives add additional pressure to the cluster etcd via CRDs, annotated logs still provide the same level of functionality and can be browsed together with logs.
Rationale: Annotated log events naturally flow through the Loki
pipeline this JEP already establishes (DD-5, DD-7), so operational
records (flash started, flash failed, image reference) are queryable,
filterable, and correlated with surrounding exporter and controller logs
using the same correlation fields (lease_id, exporter, result, …)
without a second query domain. Kubernetes Event objects (1) have a short
default TTL (~1 h) and still write to etcd on every occurrence;
status.conditions (2) is a poor fit for a sequence of operations with
variable payloads (image digest, byte count, duration); a dedicated CRD
(3) adds schema versioning, RBAC surface, and per-event etcd writes
that scale with flash volume — all pressure the cluster does not need
for data whose primary consumers are dashboards and post-mortem
queries, not reconciliation loops. Structured log events carry arbitrary
fields without CRD migration, support configurable retention in Loki,
and keep the etcd write budget reserved for scheduling and assignment
where it matters most.
DD-3: Metrics: Prometheus scrape of /metrics as the reference path¶
Alternatives considered:
HTTP
GET /metricsin Prometheus text format (pull) — the default for in-cluster Prometheus in scrape mode; works with the Prometheus Operator (ServiceMonitor),kube-prometheus, and self-hosted jobs. The optional Jumpstarter Telemetry service exposes this for aggregated counters it holds after receiving +1 / +N from exporters.Prometheus remote write (or a Mimir / Cortex receiver) from a Jumpstarter component — useful in advanced topologies; not part of the reference implementation in this JEP; operators can add a federation or
remote_writefrom Prometheus to long-term storage without the application pushing to Prometheus.Both — (1) is required for the documented path; (2) is optional infrastructure behind Prometheus, not a second required app protocol.
Reverse scrape via gRPC — exporters maintain a local
prometheus_client.CollectorRegistryand connect to the Telemetry service via a persistent bidirectional gRPC stream (MetricsStream). When Prometheus scrapes the Telemetry service’s/metricsendpoint, Telemetry fans out scrape requests to all connected exporters, merges thegenerate_latest()responses, and serves the combined result. Controller and Router still expose/metricsdirectly for Prometheus scrape (no change). This avoids push-increment complexity on the wire and keeps full counter state on the exporter at all times.
Decision: (4) — exporter-originated metrics are reverse-scraped
through the Telemetry service via MetricsStream.
Rationale: Exporters are often behind NAT or firewalls and cannot
be directly scraped by Prometheus. The reverse-scrape model (4)
solves this: the exporter initiates an outbound gRPC stream
(NAT-friendly, same direction as the existing controller connection),
the Telemetry service requests metric snapshots on demand, and full
counter state remains on the exporter at all times — eliminating
lost-increment concerns (see DD-9). The exporter uses standard
prometheus_client primitives locally, so driver authors instrument
with familiar counters and histograms. The OpenMetrics exposition
format natively carries exemplars, enabling high-cardinality context
(client, lease_id, and trace_id when present) on individual
samples without additional infrastructure. See DD-6 (no OTel),
DD-7 (Telemetry Deployment), DD-8 (HA replicas).
Exemplar trade-offs and details:
Wire format. On the OpenMetrics
/metricsendpoint an exemplar is appended after the sample value:jumpstarter_operations_total{exporter="lab-01",operation="flash",result="success"} 42 # {client="ci-bot",lease_id="abc123",build_id="nightly-42"} 1.0 1625000000.000The
# {key=value,...} value timestampsuffix is the exemplar. Grafana (≥ 7.4) renders these as clickable dots on metric panels; clicking a dot reveals the attached keys and can link to a Loki log query (filtered bylease_id) or a trace view (filtered bytrace_id).Size limit. The OpenMetrics 1.0 spec imposes a 128 UTF-8 character limit on the combined length of exemplar label names and values per exemplar. OpenMetrics 2.0 (experimental, 2026) relaxes this to a soft cap measured in bytes. The exemplar key budget is discussed further in Exemplars for high-cardinality context.
Sampling. Client libraries rate-limit exemplar updates internally; the last-seen exemplar per series is served on each scrape, not one per data point. For the Jumpstarter use case this is sufficient: the most recent
lease_id/trace_idon a counter is the value operators need when investigating a spike.Library support. Go client support is mature (
prometheus/client_golang≥ 1.16). The Pythonprometheus_clientlibrary is used on the exporter side to maintain local registries and producegenerate_latest()output for the reverse-scrape path (see API / Protocol Changes). Exemplar support in the Python library is functional but less complete than Go; if limitations arise, exemplar data can be sent as a sidecar field inMetricsScrapeResponsefor the Telemetry service to merge server-side.Infrastructure requirements. Prometheus ≥ 2.26 with
--enable-feature=exemplar-storageand--storage.tsdb.max-exemplars(e.g. 100 000). Grafana ≥ 7.4 for exemplar visualization. Perses does not yet support exemplar rendering; until it does, operators who want exemplar click-through can use Grafana alongside Perses or wait for upstream support.These limitations are acceptable for the correlation use case this JEP targets.
DD-4: Log format for services vs CLI¶
Alternatives considered:
JSON always for every process — best for machines; hard for humans.
Human text default for
jmp, JSON for long-running services and a CLI push via the Telemetry ingest endpoint in JSON format (in addition to the human-friendly output)Single format with a pretty-printer in front of developers — more moving parts.
Decision: (2). Long-running services (jumpstarter-controller,
jumpstarter-router, jumpstarter-telemetry, Exporter) emit
structured JSON to stdout. The Controller and Router do not
push logs directly to Loki; instead, a cluster-level log shipper
(Promtail, Grafana Alloy, Vector, or equivalent DaemonSet) scrapes their
pod logs and delivers them to Loki. Only jumpstarter-telemetry writes
to Loki directly (push API) because the exporter/client data it
aggregates does not originate as any pod’s stdout.
Rationale: Matches the requirement that clients stay human-readable, and at the same time all services get parseable, joinable log lines. Writing JSON to stdout and relying on the cluster log shipper for Loki delivery decouples the Controller reconciler and Router session handling from Loki availability — a Loki outage does not affect lease operations. The Telemetry service retains a direct Loki-push because it is an isolated workload (DD-7) whose core job is Loki ingest.
Format: JSONL (one JSON object per line), produced by setting
--zap-encoder=json on the existing controller-runtime / Zap logger
(no changes to log call sites — existing logr structured fields become
JSON keys automatically). The ts, level, and msg fields follow
Zap’s default JSON encoder output; application code adds domain fields
via the standard logr WithValues / Info / Error API.
Base fields present in every log line:
Field |
Format |
Loki label |
Description |
|---|---|---|---|
|
ISO-8601 ( |
no |
Timestamp (Zap default). |
|
Lower-case string ( |
no |
Log severity (Zap default; Go services map |
|
Free-form string |
no |
Human-readable message (Zap default). |
|
Fixed enum ( |
yes |
Emitting service. |
|
CRD name (when applicable) |
yes |
Exporter CRD name; bounded by cluster size. |
|
UID string (when applicable) |
no |
Lease UID (high cardinality). |
|
String (when applicable) |
no |
Operation name (flash, power, …). |
|
String (when applicable) |
no |
Outcome (success, failure, …). |
|
Category from predefined set (when applicable) |
no |
Driver category (storage, power, …). |
|
CRD name (when applicable) |
no |
Client CRD name (high cardinality). |
|
User-defined strings (during active lease) |
no |
All |
|
Values from Exporter CRD labels (when configured) |
no |
Operator-defined exporter labels (e.g. |
namespace is emitted by the application from its own runtime
context (the namespace in which the process is running). Log shippers
(Promtail, Grafana Alloy, Vector) may also inject pod and
container from Kubernetes pod metadata via service discovery.
Fields marked as Loki stream labels are extracted by the log shipper
and used as indexed stream selectors. They must be low-cardinality to
keep the active stream count manageable (Grafana recommends < 100 k
active streams per tenant). With the labels above, a deployment with
200 exporters across 5 namespaces produces roughly 1 000 streams —
well within budget. High-cardinality fields like client or
lease_id must stay in the JSON body: promoting client to a
stream label in a 1 000-client, 200-exporter cluster would create
up to 1 000 000 streams, overwhelming the Loki ingester. These fields
are instead queried with | json | client="value" filter
expressions after selecting the relevant streams.
Multi-line content (e.g. stack traces) is embedded as an escaped string
within the JSON value (typically in a stacktrace or error field),
never as bare multi-line text, so each physical line is always one
complete JSON object.
DD-5: Where Loki and Prometheus (or remote-write) credentials live¶
Alternatives considered:
Each exporter and edge host holds credentials (or a sidecar) to push directly to Loki and to Prometheus (or a metrics gateway) — maximum flexibility; maximum secret distribution and rotation burden on lab and remote sites.
Jumpstarter Controller and/or Router receive metrics and structured events from exporters and (optionally) from client traffic they already handle, and forward to the Loki push API and to Prometheus-compatible sinks (scrape registration) with in-cluster auth — one credential surface; enriched with lease, exporter, and client context in one place; must be non-blocking, bounded, and optional so the control path does not depend on Loki or Prometheus availability.
Hybrid — generic in-cluster collectors for raw pod logs and scrape; (2) for lease-scoped events and aggregated exporter metrics the platform understands.
Dedicated Jumpstarter Telemetry Deployment (see DD-7) instead of folding everything into the Controller — only Telemetry holds Loki-push credentials; isolated failure domain and scaling for reverse-scrape and log ingest. Router and Controller write structured JSON to stdout (see DD-4) and expose
/metricsfor Prometheus scrape; a cluster log shipper delivers their pod logs to Loki without Jumpstarter-specific Loki credentials.
Decision: (4)
Rationale: The goal is to avoid propagating Loki- and
cluster-ingest authentication
to every exporter process while still attaching Jumpstarter-specific
context. Among Jumpstarter components, only jumpstarter-telemetry
holds Loki-push credentials — the Controller and Router have no Loki
client dependency (see DD-4); their pod logs reach Loki via the
cluster’s existing log shipping infrastructure. Generic in-cluster
collectors solve credentials but not semantic correlation unless
integrated; alternative (2)’s trust-model advantage — which (4)
inherits — reuses the existing exporter→controller relationship and
can inject labels and tenant context in one place. A separate
Deployment (4 / DD-7) is preferable to overloading the main
reconciler when load or residency of counters matters.
DD-6: OpenTelemetry (OTLP / Collector) as a mandated layer¶
Alternatives considered:
Adopt OpenTelemetry — instrument Controller, Router, Exporter, and clients with the OTel SDK, export OTLP to a cluster-local OpenTelemetry Collector, and let the Collector fan out to Loki, Prometheus (remote write), and Tempo.
Integrate directly with each backend: Loki HTTP
POST /loki/api/v1/pushor gRPC; Prometheus text on/metrics; structured JSON (or logfmt) logs to stdout for shippers; optional W3Ctraceparentin gRPC metadata for correlation without shipping full distributed traces in the first iteration. If traces are ever needed, use Tempo ingest where practical, or a thin sender — still without a project-wide requirement on the OTel SDK in every binary.Hybrid (OTel in one language, direct in another) — lowest common implementation cost but inconsistent contributor experience and two operational models.
Decision: (2). This JEP does not make OpenTelemetry (SDK or
Collector) part of the required reference architecture. Vendors and
operators who already run an OpenTelemetry Collector may scrape the
same /metrics, receive logs shipped by existing agents, or
receive the Loki body the hub would have sent — compatibility
is welcome; dependency is not mandatory.
Rationale:
The proposed Jumpstarter Telemetry service (DD-7) admittedly reimplements a subset of OTel Collector functionality — metric aggregation, log forwarding, backpressure, and multi-replica HA. The decision to build a purpose-built component rather than adopt the OTel Collector rests on three arguments, ordered by importance:
Identity enforcement (primary) — The Telemetry service operates inside Jumpstarter’s existing authentication and trust domain (mTLS, registered client and exporter identities). It validates that every incoming
MetricsStreamorPushLogscall originates from the claimed exporter — preventing impersonation or label injection — using identities the platform already manages. A generic OTel Collector has no awareness of Jumpstarter identities; achieving the same guarantee would require an external auth policy layer (e.g. custom processors, mTLS-to-attribute mapping, and a sidecar or admission webhook to enforce label provenance), adding complexity that offsets the Collector’s generality.Operational simplicity — The Telemetry service is a single Go binary with a single config surface (the operator CR), no separate version matrix, and no generic pipeline DSL. An OTel Collector requires operator familiarity with its configuration model (receivers, processors, exporters, and connectors), dual OTel SDK stacks (Go + Python) add version drift and test matrix, and the Collector itself is another versioned service to upgrade and monitor. This overhead is not justified when the data paths are known in advance.
Narrow scope — Jumpstarter metrics and lease events map directly to Prometheus and Loki wire protocols that operators already use. Full three-pillar OTel (unified logs and metrics via OTLP) is optional product territory; this JEP optimizes for low ceremony and direct integration with exactly those two backends.
Future extension: because the Telemetry service already aggregates
metrics snapshots and structured log entries in well-defined formats,
adding an OTLP push output (logs and metrics) alongside the existing
Loki and /metrics paths would be a trivial change. This would let
operators route Jumpstarter data into an OTel Collector or any
OTLP-compatible backend without altering the exporter or client side.
The change is additive and does not require adopting the OTel SDK as a
project dependency.
DD-7: Optional Jumpstarter Telemetry service (dedicated Deployment vs. Controller/Router only)¶
Alternatives considered:
In-process in the Controller (and Router) reconciler — few moving parts; risk of CPU / GC pressure and stronger coupling between leases and high-volume increments or Loki writes.
A dedicated in-cluster Service and Deployment (working name
jumpstarter-telemetry, TBD) that: receives gRPC/HTTP increments from exporters and clients, applies them to counters in memory, POSTs to Loki, exposes/metrics, and uses the same K8s ServiceAccount / mTLS as other control-plane binaries.Split into separate sidecars (Loki-only, metrics-only) — more images to build and version.
Dedicated Deployment with reverse-scrape for metrics and push for logs — same dedicated
jumpstarter-telemetryDeployment as (2), but instead of receiving increment RPCs the service reverse-scrapes connected exporters viaMetricsStream(see API / Protocol Changes). Exporters maintain localprometheus_clientregistries; the Telemetry service requestsgenerate_latest()snapshots on demand when its/metricsendpoint is hit, merges the results, and serves them to Prometheus. Logs and events are still pushed by exporters and clients viaPushLogs. Client-side metrics are not collected — all metrically-interesting operations are observable from the exporter side.
Decision: Prefer (4) for the optional aggregated-metrics + Loki path at scale; allow (1) in small or dev clusters; (3) only if review shows a need. In deployments without Loki, the Telemetry service’s own pod logs (structured JSON to stdout) still provide a centralized, queryable event source via the cluster log shipper.
Rationale: A dedicated workload can scale and restart independently; Loki spikes and ingest load cannot starve lease reconciliation in the controller. The reverse-scrape model (4) is preferred over the increment-push model (2) because full counter state stays on the exporter — no metrics are lost when the Telemetry service restarts or is temporarily unavailable, and idempotency concerns are eliminated (see DD-9).
The service has exactly two core responsibilities: (a) reverse-scraping exporter metrics and aggregating them for Prometheus, and (b) ingesting structured logs from exporters and clients with backpressure management and forwarding them to Loki. Everything else (HA, identity enforcement, configuration) is supporting detail, not an independent responsibility.
Identity enforcement: The Telemetry service validates the source
identity of every MetricsStream connection and PushLogs RPC from
the mTLS certificate or ServiceAccount token. The exporter and
client labels on incoming data are enforced server-side to match the
authenticated identity — a compromised or misconfigured exporter
cannot submit metrics under another exporter’s name or inject
arbitrary labels.
Failure modes:
Scenario |
Behavior |
|---|---|
Telemetry service unavailable |
Exporters keep counting locally; no metrics are lost. When the exporter reconnects, the next scrape returns the full current counter state. Log push RPCs are fire-and-forget with bounded retry; log entries may be lost but device operations are unaffected. |
Telemetry pod restart |
Metric state is rebuilt on the next scrape from each connected exporter — no permanent data loss. Prometheus |
Loki unreachable |
The Telemetry service buffers log entries in a bounded queue (see Backpressure in the control-plane section). On overflow, entries are dropped and |
Prometheus scrape fails |
No data loss — the next successful scrape triggers a fresh fan-out to connected exporters and returns current values. |
The Telemetry service exposes /healthz (liveness) and /readyz
(readiness, gated on Loki reachability and at least one connected
exporter) endpoints for Kubernetes probes.
Scrape fan-out: When Prometheus hits /metrics, the Telemetry
service fans out MetricsScrapeRequest to all connected exporters in
parallel and waits up to spec.telemetry.metrics.scrapeTimeout
(default: 7 s) for responses. Only metrics received during the
current fan-out are included in the response. Exporters that do not
respond in time are omitted entirely — no cached or stale data is
ever served. This eliminates any risk of double-counting from stale
connections where the exporter may have already migrated to another
replica (see DD-8).
Memory budget: During a scrape fan-out the Telemetry service
temporarily holds metric snapshots from responding exporters until the
merged response is written to Prometheus. With 200 exporters each
producing ~50 series (bounded by {operation, result, driver_type}
label combinations), the peak is ~10 000 series at ~200–300 bytes
each, costing ~2–3 MB. Snapshots are discarded as soon as the
/metrics response is flushed — no metric data is retained between
scrapes.
DD-8: Multiple Telemetry replicas (HA) and persistent exporter connections¶
Context: With the reverse-scrape model (see DD-3 alternative 4
and API / Protocol Changes), the Telemetry service does not hold
authoritative counter state — exporters maintain their own local
prometheus_client registries. The Telemetry service only caches the
latest metric snapshot per exporter. Each exporter opens a single
long-lived MetricsStream to one Telemetry replica.
Alternatives considered:
Single replica for Telemetry — no cross-pod
sumissue; SPOF for ingest and scrape of thatService.Multiple replicas behind a load balancer; each RPC updates one pod, which only advances its partial counters for the label sets it has seen. Prometheus scrapes all pods (or separate
PodMonitortargets). In PromQL,sum by (exporter, operation, result, driver_type) (…)after droppingpod/instancematches the global total, as long as each real event is applied at most once in the system (counters are additive; increments are partitioned by traffic).Strong consistency (Raft, Redis as source of truth for counters) — higher operating cost than this JEP’s v1 scope.
Multiple replicas with persistent exporter connections — each exporter opens a single long-lived
MetricsStreamto one replica (persistent by stream). Each replica only caches metric snapshots for its connected exporters. Prometheus scrapes all replicas (viaPodMonitor);sum by (exporter, operation, result, driver_type) (…)after droppingpod/instanceyields the exact global total with no double-counting, because each exporter’s metrics appear on exactly one replica’s/metricsoutput. On replica failure the exporter reconnects to a survivor and the next scrape returns its full current counter state — no data is lost.
Decision: (4)
Rationale: Persistent exporter connections naturally partition metric
snapshots across replicas with no overlap, so sum across replicas
is exact and double-counting is impossible. Full counter state lives
on the exporter, not on the Telemetry service, so replica restarts
or failovers cause no data loss. Loki log pushes (PushLogs) are
naturally per-replica as well and do not require deduplication.
Alternative (3) adds operational complexity with no benefit given
the reverse-scrape model.
DD-9: Idempotency vs. best-effort¶
Context: With the reverse-scrape model, metrics idempotency is a
non-issue — each scrape returns the full current counter state from the
exporter, so there are no increments to deduplicate or double-count.
The only remaining idempotency concern is for PushLogs RPCs, where
a retry could result in duplicate log entries in Loki.
Alternatives considered:
Idempotent log pushes (deduplication keys per
LogEntry) — appropriate for billing- or SLO-sensitive log pipelines; requires a dedup store or Loki-side dedup.Best effort (at-least-once) for
PushLogswithout global deduplication — simpler; rare duplicate log entries on retries.Metrics idempotency (dedup keys on metric increments) — no longer applicable; the reverse-scrape model returns full state, making increment deduplication moot.
Decision: (2) for PushLogs; metrics idempotency is not needed.
Rationale: Duplicate log entries from occasional retries are acceptable for informative/diagnostic logs. Loki queries are tolerant of rare duplicates. No global dedup store is needed in v1; operators treat these logs as diagnostic signals, not audit trails.
DD-10: Perses over Grafana for dashboarding¶
Alternatives considered:
Grafana — mature, widely deployed, massive plugin and datasource ecosystem; governed by Grafana Labs (commercial); AGPL v3 license; custom JSON dashboard format; external to Kubernetes architecture.
Perses — CNCF project (vendor-neutral governance); Apache 2.0 license; standardized dashboard spec (CUE/JSON) with built-in static validation and SDKs for GitOps; Kubernetes-native (CRD support for dashboards-as-code); data-source focus on Prometheus, Loki, and Tempo — exactly the backends this JEP targets.
Decision: (2)
Rationale:
License alignment — Jumpstarter is Apache 2.0; recommending an AGPL-licensed dashboard layer introduces license friction for downstream distributors and embedders.
CNCF governance — vendor-neutral stewardship matches the project’s open-source posture; no single-vendor control over the dashboard layer.
Kubernetes-native CRDs — dashboards can be managed as K8s resources, fitting the same declarative, reconciler-driven model Jumpstarter already uses for Leases, Exporters, and the optional Telemetry Deployment.
GitOps and validation — CUE-based specs with static validation and SDKs enable dashboard-as-code in CI pipelines, consistent with the JEP’s emphasis on automation and CI integration.
Backend focus — Perses targets Prometheus, Loki, and Tempo — exactly the three backends this JEP standardizes on — without carrying the cost of a broad plugin ecosystem the project does not need.
Perses vs Grafana — practical comparison:
Aspect |
Perses |
Grafana |
|---|---|---|
License |
Apache 2.0 |
AGPL v3 |
Governance |
CNCF (vendor-neutral) |
Grafana Labs (commercial) |
Dashboard-as-code |
CUE/JSON spec, static validation, SDKs |
JSON export, no built-in validation |
K8s-native CRDs |
Yes |
Via third-party operator (grafana-operator) |
Exemplar rendering |
Not yet (upstream roadmap) |
Yes (>= 7.4) |
Data-source scope |
Prometheus, Loki, Tempo |
Broad plugin ecosystem |
Maturity / ecosystem |
Early (CNCF sandbox/incubating) |
Mature, widely deployed |
The main Perses gap today is exemplar visualization. Operators who need
exemplar overlays on dashboards should use Grafana alongside Perses or
wait for upstream support. Grafana remains fully compatible — all
/metrics and Loki endpoints are standard — so the choice is
non-exclusive.
Operators who prefer Grafana can still point it at the same /metrics and Loki
endpoints; this DD only governs the recommended dashboard experience.
Design Details¶
Correlation and fields¶
Subject to review — names and cardinality rules should be fixed before “Implemented”.
Field / label |
Prom label |
Prom exemplar |
Loki stream |
Log line |
Notes |
|---|---|---|---|---|---|
|
yes |
— |
yes |
yes |
CRD name; bounded by cluster size. |
|
yes |
— |
no |
yes |
Small fixed enum (flash, power, …). |
|
yes |
— |
no |
yes |
Small fixed enum (success, failure, …). |
|
yes |
— |
no |
yes |
Category from a predefined set in core (storage, power, …). |
|
yes |
— |
no |
yes |
Failure class (timeout, device_error, …); on errors. |
|
yes |
— |
no |
yes |
tx / rx; for byte-counter and stream metrics only. |
|
no |
— |
yes |
yes |
Fixed set (cli, controller, router, telemetry, exporter). |
|
no |
— |
yes |
yes |
K8s namespace; bounded. |
|
no |
yes |
no |
yes |
Unbounded; exemplar for drill-down. |
|
no |
yes |
no |
yes |
CRD name; exemplar for client identity. |
|
no |
yes |
no |
yes |
From |
|
no |
yes |
no |
yes |
W3C; links metrics to traces via exemplars. |
|
no |
yes |
no |
yes |
From Exporter CRD labels; included when listed in |
Additional lease.spec.context correlation fields can be added at runtime;
they appear as structured log line fields and, when listed in the operator’s
exemplarKeys allowlist, as Prometheus exemplar keys (see Exemplars for
high-cardinality context below and Operator configuration).
Cardinality guidelines¶
Unbounded identifiers (lease_id, client, image_digest, trace_id, and
any operator-defined spec.context keys) must not be used as Prometheus metric
labels or Loki stream labels. They belong inside structured log line JSON
and Prometheus exemplars (see below), where Loki filter expressions
(| json | lease_id = "…") and dashboard exemplar overlays can surface them
without inflating the label index or TSDB series count.
Rules of thumb for this JEP:
Prometheus labels: each metric label dimension should have < 100 distinct values per scrape target. The label set for Jumpstarter metrics is
{exporter, operation, result, driver_type}— all bounded enums.error_typeis added on failure-path metrics anddirectionon byte-counter metrics. High-cardinality context is carried via exemplars, not labels.Loki: stream labels should be a small fixed set (
{component, exporter, namespace}) to keep active stream count per tenant manageable (Grafana’s guidance: < 100 k active streams). High-cardinality fields go inside the log line body.Lease context fields from
spec.contextare propagated into log line JSON and, when listed inexemplarKeys, into Prometheus exemplars. They never become Prometheus labels or Loki stream labels.
Exemplars for high-cardinality context¶
Prometheus exemplars attach arbitrary key-value pairs to individual counter
increments and histogram observations without creating new time series. This
is the primary mechanism this JEP uses to surface per-request context
(client, lease_id, and trace_id when present) on metrics while keeping series cardinality
flat.
Default exemplar keys emitted on every counter/histogram observation:
Key |
Source |
Purpose |
|---|---|---|
|
Client CRD name |
“Which client caused this spike?” |
|
Lease UID |
Correlate a metric sample with lease logs. |
|
W3C |
Included only when present in gRPC metadata. |
trace_id is not synthesized by Jumpstarter — it is included only when
an external caller (CI pipeline, user code) propagates a traceparent.
Full distributed tracing (spans, storage, visualization) is deferred to
a future JEP; when it lands, trace_id becomes a default key. Until
then, omitting it saves ~45 characters of exemplar budget.
spec.context keys (e.g. build_id, image_digest) are included as
exemplar keys when listed in the operator’s exemplarKeys allowlist (see
Operator configuration). Because exemplars are per-observation metadata —
not label dimensions — they have zero impact on series cardinality regardless
of how many distinct values appear.
Exemplar size budget: The OpenMetrics 1.0 limit is 128 UTF-8
characters for the combined key-value pairs in a single exemplar.
The two default keys (client, lease_id) consume roughly 30–50
characters, leaving ~80–100 characters for spec.context entries
(or more when trace_id is absent). To stay within budget:
Keys are added in the order specified by the operator’s
exemplarKeysallowlist (see Operator CRD fields below). The default list starts withclient,lease_id;trace_idis added when present in the request context.Remaining
spec.contextentries are appended in allowlist order until the 128-char limit is reached; keys that do not fit are silently dropped from the exemplar (they remain available in structured log lines). This gives the operator full control over which keys are prioritized when space is tight.The
LeaseCRD validatesspec.contextat admission time: key names are limited to 32 characters, values to 64 characters, and the total number of entries to 8. This prevents accidental budget exhaustion and ensures exemplar truncation is rare in practice.
Dashboard visualization: when exemplars are enabled on a Prometheus data
source, metric panels render clickable dots on each sample that carries
exemplar data. Clicking a dot reveals the attached keys and can link to
Loki log queries (filtered by lease_id) or a Tempo trace view (filtered
by trace_id).
Per-client analysis remains available via LogQL for operators who do not
use exemplars:
sum by (client) (count_over_time({component="exporter"} | json | operation="flash" [5m])).
Proposed metrics¶
Names are illustrative; final naming should follow Prometheus naming conventions and be fixed before “Implemented”.
Metric name |
Type |
Labels |
Description |
|---|---|---|---|
|
counter |
|
Total operations performed. |
|
histogram |
|
Duration of each operation. |
|
counter |
|
Errors by class (timeout, device, …). |
|
counter |
|
Bytes transferred (tx/rx) on streams. |
|
gauge |
|
Currently active lease sessions. |
|
counter |
|
Lease acquire attempts (controller). |
|
counter |
|
Log entries dropped due to backpressure (e.g. |
|
counter |
|
Scrape fan-out timeouts per exporter (Telemetry-side). Each timeout also emits a |
All counters and histograms carry exemplar keys from the operator’s
exemplarKeys allowlist (by default client and lease_id; trace_id
when present; spec.context and exporterLabels entries when listed)
on every observation.
Metric usage and alerting¶
Metric |
Primary use |
Alert? |
Starter threshold |
|---|---|---|---|
|
Dashboard |
yes |
Failure rate > 20 % over 15 min per exporter. |
|
Dashboard |
yes |
p95 > 60 s per operation type. |
|
Dashboard |
yes |
Error rate rising; group by |
|
Dashboard |
no |
— |
|
Dashboard |
yes |
0 sessions for > 30 min (possible exporter issue). |
|
Dashboard |
yes |
Failure rate > 10 % over 15 min. |
|
Alerting |
yes |
Any increment (telemetry pipeline saturated). |
|
Alerting |
yes |
Repeated timeouts for same exporter (connectivity or load issue). |
Thresholds are suggestions; operators should tune them to their
environment. The operator should ship a set of example PrometheusRule
CRDs based on the table above that operators can enable and customize.
These rules are opt-in and disabled by default to avoid noise in
environments with different baselines.
High-frequency byte counters: jumpstarter_stream_bytes_total can
be incremented at very high rates on serial and video streams. Because
metrics live in the exporter’s local prometheus_client registry, high
update rates do not generate any RPC traffic — the counter is updated
in-process and only serialized when the Telemetry service sends a
MetricsScrapeRequest.
Example queries¶
PromQL (Prometheus)¶
Flash failure rate per exporter:
sum by (exporter) (rate(jumpstarter_operations_total{operation="flash", result="failure"}[5m]))
/
sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[5m]))
p95 flash duration per driver type:
histogram_quantile(0.95,
sum by (driver_type, le) (rate(jumpstarter_operation_duration_seconds_bucket{operation="flash"}[5m]))
)
Top 5 busiest exporters (all operations, 1 h window):
topk(5, sum by (exporter) (rate(jumpstarter_operations_total[1h])))
Alert: exporter flash failure rate > 20% over 15 min:
(
sum by (exporter) (rate(jumpstarter_operations_total{operation="flash", result="failure"}[15m]))
/
sum by (exporter) (rate(jumpstarter_operations_total{operation="flash"}[15m]))
) > 0.2
Error breakdown by class for a specific driver:
sum by (error_type) (rate(jumpstarter_operation_errors_total{driver_type="storage"}[1h]))
Bytes per second by exporter and direction:
sum by (exporter, direction) (rate(jumpstarter_stream_bytes_total[5m]))
Exporters with repeated scrape timeouts (last 30 min):
topk(10, sum by (exporter) (increase(jumpstarter_scrape_timeouts_total[30m])))
HA Telemetry: aggregate across replicas (drop pod/instance):
sum by (exporter, operation, result, driver_type) (rate(jumpstarter_operations_total[5m]))
LogQL (Loki)¶
All flash events for a specific lease:
{component="exporter"} | json | operation="flash" | lease_id="<uid>"
Flash failures per client over 5 min (log-based, no exemplars needed):
sum by (client) (
count_over_time({component="exporter"} | json | operation="flash" | result="failure" [5m])
)
Controller logs for a specific lease (post-mortem):
{component="controller"} | json | lease_id="<uid>"
Error events across all exporters in a namespace:
{component="exporter", namespace="production"} | json | result="failure"
Telemetry service health (its own operational logs):
{component="telemetry"} | json | level="error"
Control-plane aggregation (Controller / Router / optional Telemetry)¶
When this mode is enabled in a deployment:
Exporters maintain local
prometheus_clientregistries and open aMetricsStreamto the optionaljumpstarter-telemetryservice (DD-7). On each Prometheus scrape the Telemetry service fans outMetricsScrapeRequestto all connected exporters in parallel, merges the responses, and serves the combined output on/metrics(DD-3). HA (multiple replicas with persistent exporter connections) usessumin PromQL (DD-8). Exporter and edge processes never need Loki or cluster-scrape credentials directly (DD-5).Exporters and clients (
jmp) push structured log entries to the Telemetry service viaPushLogs. The Telemetry service forwards these to Loki. Best-effort duplicate tolerance applies (DD-9).Controller and Router emit structured JSON logs to stdout (see DD-4). They do not push logs directly to Loki; a cluster-level log shipper (Promtail, Grafana Alloy, Vector, or equivalent) scrapes their pod logs and delivers them to Loki. This decouples the reconciler and session-handling hot paths from Loki availability.
Backpressure: The Telemetry service uses a bounded ring buffer for the Loki log push path with a configurable depth (default: 10 000 entries, see
spec.telemetry.backpressure.queueDepth). On overflow, dropped entries are replaced by a single drop marker — a standardLogEntrywithseverity="warning",component="telemetry",operation="backpressure", and the drop count and time window placed inextra_fields({"count":"142","window_seconds":"12"}). Subsequent drops while the buffer is still full accumulate into the same marker rather than adding new entries, so the queue always retains one slot for the current drop summary. Because the marker is a regularLogEntry, consumers do not need special-case parsing to detect or exclude it. Ajumpstarter_telemetry_dropped_totalcounter (partitioned bydestination={loki}) is also incremented on/metricsfor alerting. Metrics do not need backpressure — the reverse-scrape model is pull-based and transient (no buffering between scrapes). Because the Controller and Router do not push to Loki, their lease/session operations are inherently isolated from Loki slowdowns.Multi-tenancy: write-side tenant scoping (e.g. namespace-based separation in Loki and Prometheus) is a deployment concern handled by the log shipper and Prometheus configuration. Read-side access control (who can query which metrics or logs) is likewise a deployment concern and out of scope for this JEP.
Metric facts originate on the exporter (local
prometheus_clientcounters/histograms); the Telemetry service is a transparent scrape-aggregation proxy. Controller and Router expose their own/metricsfor Prometheus scrape and rely on the log shipper for their stdout logs.
High-level data flow¶
Client (jmp)¶
flowchart LR
jmp([jmp CLI]) -->|session gRPC| exp[Exporter]
jmp -->|structured logs| tel[jumpstarter-telemetry]
The CLI connects to the Exporter for device sessions and sends structured logs to the Telemetry service for Loki ingest (see DD-4).
Exporter¶
flowchart LR
ctrl[jumpstarter-controller] -->|lease lifecycle| exp[Exporter]
drv[Drivers] --> exp
exp <-->|MetricsStream| tel[jumpstarter-telemetry]
exp -->|PushLogs| tel
The Controller assigns leases; the Exporter delegates to Drivers and
maintains local prometheus_client counters. It opens a MetricsStream
to Telemetry for reverse-scrape and pushes structured logs via PushLogs
(see DD-2, DD-3, DD-5, DD-7).
Telemetry to backends¶
flowchart LR
prom[(Prometheus)] -->|scrape /metrics| tel[jumpstarter-telemetry]
tel <-->|MetricsStream fan-out| exp[Exporters]
tel -->|push API| loki[(Loki)]
tel -->|JSON stdout| shipper[Log shipper]
shipper -->|pod logs| loki
On each Prometheus scrape, Telemetry fans out MetricsScrapeRequest to
all connected exporters in parallel, merges responses, and serves the
combined output. Logs received via PushLogs are forwarded to Loki
(DD-3, DD-7, DD-8).
Controller to backends¶
flowchart LR
ctrl[jumpstarter-controller] -->|JSON stdout| shipper[Log shipper]
shipper -->|pod logs| loki[(Loki)]
ctrl -->|/metrics| prom[(Prometheus)]
The Controller writes structured JSON to stdout (see DD-4). A
cluster log shipper scrapes pod logs and delivers them to Loki. The
Controller exposes /metrics for reconciliation and lease-level counters.
Router to backends¶
flowchart LR
router[jumpstarter-router] -->|JSON stdout| shipper[Log shipper]
shipper -->|pod logs| loki[(Loki)]
router -->|/metrics| prom[(Prometheus)]
The Router writes structured JSON to stdout (see DD-4). A
cluster log shipper scrapes pod logs and delivers them to Loki. The
Router exposes /metrics for routing and session-level counters.
The diagrams above summarize the reverse-scrape hub model described in Control-plane aggregation. For credential isolation see DD-5; for the Telemetry Deployment see DD-7; for HA with persistent exporter connections see DD-8; for best-effort log semantics see DD-9. No OpenTelemetry Collector is required (see DD-6); operators may run one alongside and scrape the same targets if they choose.
Common open-source backends (direct integration; no mandatory OTel)¶
This JEP’s target wire protocols and components are Prometheus and
Loki (and, if trace export is ever added, Tempo or Jaeger with
native ingest or HTTP — not OTLP as a Jumpstarter requirement; see
DD-6). OpenTelemetry is a parallel ecosystem: teams can run a
Collector next to Jumpstarter and still scrape /metrics and ship
logs with Promtail-class agents; the reference design does not depend
on the OTel SDK in application code.
Prometheus for metrics (and Alertmanager for routing alerts): scrape the
/metricsendpoint, remote-write to long-term store if needed, and drive dashboards in Perses or self-hosted UIs (see DD-10).kube-state-metricsand the Prometheus Operator are common in Kubernetes; vendors often package the same projects, but this JEP refers to the open-source components by name.Loki (Grafana Labs, AGPL) for log storage and querying; it pairs with Perses (see DD-10) for search and with Promtail, Grafana Agent, or Grafana Alloy to ship logs, or with application push to Loki’s HTTP API as already discussed in the control-plane path.
Traces (optional, future work) — if adopted, Grafana Tempo and Jaeger are typical stores; use W3C Trace Context in RPC metadata for correlation even when full trace export is off. OTLP may be only a convenience for operators; it is not a JEP-0011 core dependency.
A typical Kubernetes integration path:
ServiceMonitor+ Prometheus (or a compatible remote-write consumer), a Loki endpoint for logs — any EKS, GKE, AKS, self-managed Kubernetes, or bare-metal install that runs these same projects can be the target; the implementation plan should name tested combinations (Prometheus and Loki version pairs where relevant) inImplementation History, not a single product bundle.
Operator configuration¶
The Jumpstarter operator CR controls telemetry behavior cluster-wide.
Observability settings live under spec.telemetry so that administrators
can tune metrics, logging, and exemplar behavior without editing code.
Key configurable fields:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Deploy the optional Telemetry service. |
|
|
— |
Loki push endpoint; optional — Telemetry can run metrics-only without Loki. |
|
|
— |
Secret with Loki credentials (see DD-5). |
|
|
— |
Secret containing a CA bundle ( |
|
|
|
Disable TLS certificate verification (development/testing only). |
|
|
|
Exporter-level label keys (e.g. |
|
|
|
Allowlist of keys to include in exemplars (including |
|
|
|
Allowed |
|
|
|
Create |
|
|
|
Deploy starter |
|
|
|
Max time to wait for parallel exporter responses during a |
|
|
|
Ring buffer depth for Loki log push queue. |
Example CR snippet:
apiVersion: operator.jumpstarter.dev/v1alpha1
kind: Jumpstarter
metadata:
name: jumpstarter
spec:
telemetry:
enabled: true
exporterLabels:
- board-type
logging:
filter:
min_severity: warning
loki:
url: "https://loki-gateway.monitoring.svc:3100/loki/api/v1/push"
secretRef: "loki-credentials"
tls:
caSecretRef: "loki-ca-bundle"
metrics:
exemplarKeys:
- client
- lease_id
- build_id
- board-type
driverTypeEnum:
- power
- storage
- network
- serial
- console
- video
- composite
serviceMonitor: true
prometheusRules: true
scrapeTimeout: "7s"
backpressure:
queueDepth: 20000
The driverTypeEnum list acts as an allowlist: drivers must select a
category from this set (or fall back to other). This keeps the
driver_type Prometheus label bounded and prevents cardinality
surprises from third-party drivers. Administrators can extend the list
for site-specific driver categories.
The exporterLabels list names Exporter CRD label keys whose values
are copied into every log JSON field and made available as exemplar
candidates for operations involving that exporter. For example, setting
exporterLabels: ["board-type"] means an Exporter with the label
board-type: rpi4 will include "board-type": "rpi4" in its
structured log lines and in the exemplar candidate pool. The list is
empty by default — no exporter labels are propagated unless the
administrator opts in.
The exemplarKeys list is an allowlist that controls which keys are
included in Prometheus exemplars. This filters everything — built-in
keys (client, lease_id), spec.context keys, and exporterLabels
keys alike. Only keys present in exemplarKeys are emitted; unlisted
keys are omitted even if available. This gives administrators full
control over exemplar budget usage: adding board-type to both
exporterLabels and exemplarKeys propagates hardware type into
exemplars, while removing lease_id frees budget for other entries.
Loki transport: During implementation, evaluate whether the Telemetry
service should connect to Loki via the HTTP push API
(/loki/api/v1/push) or the gRPC endpoint. gRPC may offer better
throughput and streaming semantics (aligned with Jumpstarter’s existing
gRPC infrastructure), while the HTTP API is simpler to debug and more
broadly supported by Loki-compatible backends. The spec.telemetry.loki.url
field should accept either scheme (http:// / grpc://) so the choice
remains a deployment decision.
Loki TLS: Many deployments terminate Loki behind a TLS endpoint
with an internal or self-signed CA. The spec.telemetry.loki.tls
subsection follows the same pattern as the existing operator TLS
configuration: caSecretRef names a Kubernetes Secret whose ca.crt
key contains the PEM-encoded CA bundle to trust. When set, the
Telemetry service adds this CA to its TLS root pool when connecting to
Loki. insecureSkipVerify disables certificate verification entirely
and should only be used in development or testing environments.
Test Plan¶
Unit Tests¶
Log field builders and redaction: ensure defaults strip secrets; optional fields behind flags.
Metric registration helpers: label validation and naming conventions.
Integration Tests¶
Operator + exporter: scrape or receive metrics; assert presence of a minimal documented set of series after a known operation.
If the control-plane forward path is implemented: with a test Loki and a Prometheus-compatible sink (or mock), assert that records arrive with expected correlation fields (
lease_id,exporter, …) and that exporter pods do not require Loki or cluster-scrape credentials in their spec.If Telemetry runs with >1 replica: one test verifies that
sumby business labels (droppingpod/instance) matches expected totals with persistent exporter connections (see DD-8).Lease with metadata: objects validate; events or status updates match expected structure.
Hardware-in-the-Loop¶
Flashing and power paths: at least one driver records an event and/or metrics counter on success and failure on real hardware in a lab.
Serial and stream paths expose tx/rx byte counts.
Independent testability¶
Each component must be testable in isolation without deploying the full stack:
Structured logging: unit tests validate JSON output format, base fields, and
spec.contextpropagation using an in-memory logger — no Loki required.Exporter metrics: unit tests verify counter/histogram registration, label correctness, and exemplar attachment using a local Prometheus registry — no Telemetry service required.
Telemetry service: integration tests use mock gRPC clients and a mock Loki endpoint to verify ingest, counter aggregation, backpressure behavior, and drop markers — no real exporters required.
Operator configuration: unit tests validate CRD admission (e.g.
spec.contextsize limits) andServiceMonitorgeneration.
End-to-end (CI)¶
The full telemetry pipeline should be exercised in GitHub Actions CI. Evaluate feasibility of running a minimal Prometheus + Loki stack inside the CI environment (e.g. single-binary mode containers); if resource constraints make this impractical, at minimum:
Loki mock or single-binary: a lightweight Loki instance (or a mock HTTP/gRPC endpoint that validates the Loki push API contract) receives logs from the Telemetry service and asserts expected fields, stream labels, and
spec.contextpropagation across the full exporter → Telemetry → Loki path.Prometheus scrape: the existing Go/Ginkgo E2E test suite performs direct HTTP scrapes of the
/metricsendpoints on Controller, Router, and Telemetry services — no separate Prometheus instance required. The test parses the OpenMetrics response and asserts that documented series, labels, and exemplars appear after a known operation sequence.Correlation round-trip: an E2E test runs a lease lifecycle (create → flash → power-cycle → release) and verifies that the same
lease_idandexportervalues appear in both scraped metrics (label or exemplar) and ingested log entries, confirming cross-signal correlation.
Feasibility of this stack should be evaluated early (Phase 1) so that all subsequent phases have E2E coverage from the start.
Manual¶
jmpdefault output remains readable; JSON structured logs are only sent to jumpstarter-telemetry for general log ingest.
Acceptance Criteria¶
[ ] Exporter (or sidecar) exposes a documented metrics surface; drivers can contribute without reimplementing the HTTP server ad hoc in each driver.
[ ] Controller and one data-plane service emit structured logs with a documented minimum field set;
[ ] Operator provides a section to enable metrics, with the right details/secret references to integrate with Loki for pushing logs.
[ ] Operator attempts to auto-configure Prometheus metric scraping on the right endpoints.
[ ] A JSON schema (or equivalent machine-readable specification) is published for the structured log format, enabling consumers to validate log entries and detect regressions in field names or types.
[ ] Backward compatibility: existing clients and manifests without the new fields continue to work; deployments that do not use hub forwarding behave as today.
Graduation Criteria¶
Experimental (first release behind flag or doc-only)¶
JEP in Discussion; partial implementation; known gaps listed in Unresolved Questions.
Stable¶
Acceptance criteria met; SLOs for log volume and metric cardinality documented; upgrade notes for the operator and CLI.
Backward Compatibility¶
New CRD fields and labels must be optional; existing lease flows unchanged.
gRPC: new metadata must be additive; servers tolerate missing trace and context fields from older clients; clients ignore unknown fields where applicable.
AuditStreamremoval: TheAuditStreamRPC andAuditStreamRequestmessage onControllerServiceare removed. This RPC was never implemented or called by any client —Grepacross the codebase confirms zero usage outside its protobuf definition. Removing it is a no-op for all existing deployments. The newPushLogsRPC onTelemetryServicesupersedes the intended use case.LogStreamResponseenrichment (new optional fieldsdriver_type,operation,timestamp,structured_fields) is purely additive and backward-compatible — existing clients ignore unknown fields.No removal of current default CLI behavior; JSON logging only when selected.
Consequences¶
Positive¶
Operators can route logs and metrics to existing Prometheus, Loki, and Perses-based stacks (self-hosted or platform-managed under the hood) without a mandatory OpenTelemetry Collector in front of Jumpstarter (see DD-6, DD-10).
CI can correlate a failed run to equipment and build metadata.
Driver authors get a single pattern for operation counters and event emission.
Security-conscious users can run with minimal log fields and no trace.
Operators can keep Loki, Prometheus, and related API tokens in-cluster only; exporters keep a single Jumpstarter trust relationship (DD-5).
The optional Telemetry service isolates Loki/series work from the reconciler (DD-7, DD-8); Controller and Router carry no Loki client dependency, so a Loki outage cannot affect lease operations (DD-4).
Negative¶
More code paths, dependencies (for example a Prometheus client library, Loki HTTP client, and structured log helpers), and operability and documentation burden.
Operators must run a functioning cluster log shipper (Promtail, Grafana Alloy, Vector, or equivalent) to see Controller and Router logs in Loki. This is near-universal in production Kubernetes but worth documenting for minimal or dev clusters.
Risks¶
High-cardinality metadata accidentally promoted to metric labels could overload TSDB. Cardinality guidelines restricts labels to bounded enums and routes variable context through exemplars and log line fields instead.
Exemplars require the OpenMetrics exposition format and Prometheus >= 2.26 with exemplar storage enabled (on by default since Prometheus 2.39). Operators on older Prometheus versions still get full metrics and logs; exemplar-based drill-down is unavailable until they upgrade.
Prometheus / Loki / Perses-stack version drift in the field — document tested pairs; W3C Trace Context in gRPC remains best-effort across Python and Go (no OTel SDK requirement to propagate
traceparentwhere needed).
Rejected Alternatives¶
“All metrics and facts are generated only in the controller” — would miss per-exporter and per-driver truth; rejected. Forwarding exporter-originated series and events through the control-plane (with stable labels) is not the same and remains in scope (see DD-5).
Requiring Loki- and Prometheus-ingest credentials on every exporter and edge as the only supported model — rejected in favor of optional hub forwarding and of cluster-native collectors that also avoid per-host secrets, even though those collectors are not Jumpstarter-specific.
“Mandatory OpenTelemetry SDK and Collector” for all metrics, logs, and traces — rejected for the reference architecture; rationale in DD-6 (optional parallel deployment by operators is still fine).
“Unstructured logs everywhere; parse with regex” — rejected as unscalable for joins with traces and multi-service incidents.
“Mandatory full tracing for every command” — high overhead; rejected; prefer sampling and opt-in for heavy paths.
“Push metric increments from exporters to telemetry” — exporters would send
+1/+Ncounter increments and histogram observations to the Telemetry service, which would maintain in-memory counters and expose them on/metrics. Rejected because: (a) counter state would be lost on Telemetry restart, (b) retries introduce double-counting requiring idempotency logic, and (c) high-frequency counters (e.g. stream bytes) generate excessive RPC traffic. The reverse-scrape model keeps full counter state on the exporter and generates zero RPC traffic between scrapes (see DD-3 alternative 4, DD-7).“Reuse
AuditStreamfor telemetry log push” —AuditStreamwas an unimplemented stub onControllerServicewith no message schema for structured telemetry data. Rather than retrofitting it, a purpose-builtPushLogsRPC on the newTelemetryServiceprovides a cleaner contract and separates telemetry from the controller’s reconciliation API.
Prior Art¶
Prometheus and Alertmanager — time-series metrics and alerting; Prometheus naming and labels on cardinality and naming; remote write for non-scrape topologies; Exemplars for attaching high-cardinality context to individual samples.
Grafana exemplar support — visualizing exemplars in metric panels and linking to traces or logs.
Loki — log aggregation, label model, and push and query APIs; often combined with Perses (see DD-10) and Grafana Agent / Alloy or Promtail for log shipping.
Grafana Tempo or Jaeger — common trace backends (native or HTTP ingest; OTLP where the operator uses it — not a Jumpstarter code dependency; see DD-6).
Perses — CNCF dashboard project; Apache 2.0; Kubernetes-native CRDs; CUE/JSON spec with GitOps SDKs; focused on Prometheus, Loki, and Tempo data sources (see DD-10).
OpenTelemetry and the OpenTelemetry Collector — relevant as ecosystem and operator-side optional plumbing; this JEP intentionally does not adopt them in-process by default (DD-6).
Other HiL / test systems often separate “run metadata” (like Jenkins build id) from device state; similar separation maps well to this JEP’s lease context + events.
Unresolved Questions¶
Event retention: Loki retention policy (per-tenant, per-stream retention classes) for annotated log events (DD-2); whether Jumpstarter should document recommended retention defaults or leave this to operators.
Future Possibilities¶
SLOs and error budgets on lease acquisition time, flash success rate, and mean time to recovery of exporters.
Per-tenant or per-namespace dashboards as samples in the docs.
Not part of this JEP: billing usage metering (could reuse metrics later).
Implementation History¶
JEP-0011 proposed: 2026-04-23
JEP-0011 updated based on feedback: 2026-04-29
References¶
W3C Trace Context (
traceparent)Upstream project docs for the Prometheus, Loki, and Perses versions (and optional Tempo / Jaeger if used) in a given deployment; pin versions in release notes and integration tests.
This JEP is licensed under the Apache License, Version 2.0