Metrics (OpenTelemetry)#
Salt can emit OpenTelemetry metrics — counters, histograms and observable gauges — for the operational signals that operators most often care about: job throughput, return latency, minion connectivity, worker queue depth, file-descriptor pressure, and returner egress health.
Metrics complement the distributed tracing story. Traces answer "what happened during one job?"; metrics answer "what's happening across the fleet right now?".
The instrumentation is disabled by default and is a complete no-op when not configured. No exporter is initialised, no background threads are started, no Prometheus listener is bound, and no payload changes land on the wire.
Configuration#
Add a metrics block to the master and minion configs. Settings
are the same on both daemons.
metrics:
enabled: true
exporter: otlp-http # otlp-http | otlp-grpc | prometheus | console
endpoint: "" # OTLP collector URL (empty = SDK default)
service_name: "" # empty = auto-derived from process role
resource_attributes: {} # extra OTel Resource attributes
insecure: true # gRPC TLS off (ignored for non-grpc)
headers: {} # OTLP auth headers
export_interval_seconds: 60 # PeriodicExportingMetricReader interval
prometheus:
host: 127.0.0.1 # localhost-bind by default
port: 9464
histogram_boundaries:
salt.job.duration: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000]
salt.minion.exec.duration: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]
salt.master.requests.duration: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]
enabledMaster switch.
false(the default) means everything in this module is a no-op.exporterotlp-http(default) — push OTLP protobuf over HTTP. Pure-Python; ships in salt's base requirements.otlp-grpc— push OTLP over gRPC. Requiresopentelemetry-exporter-otlp-proto-grpcinstalled separately (it pulls ingrpcio, which lacks prebuilt wheels for some platform / interpreter combinations).prometheus— bind a local/metricsHTTP endpoint that Prometheus can scrape. Operators who already run Prometheus can skip the OTel Collector entirely.console— print metrics to stdout for debugging.
endpointOTLP collector URL when
exporterisotlp-httporotlp-grpc. When empty, the OTel SDK default is used (http://localhost:4318/v1/metricsfor HTTP,http://localhost:4317for gRPC).service_nameThe
service.nameresource attribute. When empty, salt fills this in automatically:salt-master,salt-minion-<id>,salt-cli,salt-call,salt-api.export_interval_secondsHow often the periodic exporter flushes to the collector. Ignored for the Prometheus pull exporter (Prometheus controls cadence via its scrape interval).
prometheus.host/prometheus.portWhere the Prometheus pull listener binds. Defaults to localhost-only. In a multi-process master only the parent binds this port; counters incremented inside MWorker children are not visible through the parent's
/metrics(use the OTLP push exporter if you need worker-side counters in a Prometheus deployment with multiple workers).histogram_boundariesPer-instrument explicit bucket boundaries. The defaults span sub-millisecond to one minute for
salt.job.durationand sub-millisecond to ten seconds forsalt.minion.exec.duration.
Instrument inventory#
Counters#
salt.jobs.published{fun}— jobs published from master to minions.salt.jobs.completed{fun,success}— returns received from minions.salt.auth.attempts{result}— master auth attempts;resultis one ofsuccess,invalid_id,max_minions,rejected,error.salt.master.requests.handled{cmd}— every request dispatched by the master worker (clear-funcs + aes-funcs), labelled by the saltcmdname (publish,_auth,_return,_serve_file,mine_get, …). This is the OTel mirror of the per-command runs counter thatmaster_statsexposes via the event bus.salt.events.fired{tag_prefix}— events placed on the event bus, labelled by the first non-saltsegment of the tag.salt.returners.calls{returner,status}— minion-side returner invocations;statusisok,missing, orerror.
Histograms#
salt.job.duration{fun}(ms) — CLI-to-master-return wall-clock per minion return. Recorded byLocalClient.get_iter_returnson each return event.salt.minion.exec.duration{fun}(ms) — minion-side wall-clock for a single function execution (the same window thesalt.minion.exec.<fun>trace span covers).salt.master.requests.duration{cmd}(ms) — per-command master worker dispatcher latency, recorded inMWorker._handle_clearandMWorker._handle_aes. Together with the matchingsalt.master.requests.handledcounter this gives feature parity with the legacymaster_statsper-commandruns+meansurface, but live in OTel instead of fired as periodic events.
Observable gauges#
salt.master.connected_minions.count— sourced fromsalt.utils.minions.CkMinions.connected_ids(). Registered only in the master parent process to avoid worker over-count.salt.master.workers.queue.depth{pool}— MWorker payloads in flight, observed via a sharedmultiprocessing.Valuethat every worker increments on_handle_payloadentry and decrements on exit.salt.process.open_fds({fd}) — current file-descriptor count frompsutil.Process().num_fds(). Registered separately in the master parent and in the minion process. Returns no observations on Windows wherenum_fdsis unavailable.
Label cardinality#
Every label above has a bounded domain — fun (the salt module
namespace), result and status (small enums), returner (the
configured returner names), pool (the configured worker pool
names), tag_prefix (a small set of event tag namespaces). No
instrument uses minion_id, jid, or user as a label.
Operators adding their own instruments should follow the same
discipline — these belong as trace span attributes, not metric labels.
Running a quick demo#
OTLP / OpenTelemetry Collector:
docker run -d --name otelcol \
-p 4318:4318 \
otel/opentelemetry-collector-contrib
Configure master + minion:
metrics:
enabled: true
exporter: otlp-http
endpoint: http://localhost:4318/v1/metrics
export_interval_seconds: 10
Start them, run a few salt '*' test.pings, and watch the collector
logs for salt.jobs.published, salt.job.duration and friends.
Prometheus pull:
metrics:
enabled: true
exporter: prometheus
prometheus:
host: 127.0.0.1
port: 9464
curl -s http://127.0.0.1:9464/metrics | grep '^salt_' then shows
the salt-namespaced metrics.
Fork handling#
Like the tracing SDK, the OTel PeriodicExportingMetricReader
background thread does not survive fork(). Salt rebuilds the
provider in every forked child the first time a metrics API is invoked,
so master workers and minion executor processes each get their own
functioning reader without any caller action.
Observable gauges are registered exactly once — in the master parent for master-side gauges, in the minion process for minion-side gauges — to avoid forked-worker over-counting.
Payload and CPU overhead#
Metric increments are zero-allocation when metrics are disabled (every
public function short-circuits before touching the OTel SDK). When
enabled, counter and histogram operations are sub-microsecond. The
PeriodicExportingMetricReader background thread wakes on
export_interval_seconds (default 60s). The Prometheus pull listener
binds a single local port and serves a few KiB of text per scrape.
No metric instrumentation changes the on-the-wire format of any salt request, event, or return — they are purely local to each daemon's process.