More succbone metrics #14
Loading…
Reference in a new issue
No description provided.
Delete branch "metrics"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As discussed. I am unsure about the second commit - it seems to me that uptime via the counter
and uptime via totalizing the status metric
are equivalent.
If I understand correctly, the counter would only be beneficial if we would track its total value across
succd
restarts on the succbone because then totalizing the value doesn't depend on the data retention of prometheus. Adding that is of course non-trivial if we want it to be robust.What are your thoughts?
2cfea46852
toa16dc6b73e
@ -171,2 +179,3 @@
state := s.d.snapshot()
mbar := state.piraniMbar100.mbar
// sem_pressure_mbar is meant to represent the fused pressure value
Let's wait until we have more data sources and work on the model then - I don't think there's much sense in doing this ahead of time?
My main motivation here was to not break the existing metric but keep consistency in the naming - a
sem_pressure_volts
would not make sense but havingsem_pressure_mbar
alongsidesem_pirani_volts
is also not nice. So by duplicating the metric, the names are consistent and we don't break already recorded metrics data for no good reason.@ -176,0 +195,4 @@
fmt.Fprintf(w, "# HELP sem_pirani_failsafe Whether pirani gauge failsafe mode is active (boolean)\n")
fmt.Fprintf(w, "# TYPE sem_pirani_failsafe gauge\n")
fmt.Fprintf(w, "sem_pirani_failsafe %f\n", boolToFloat(state.safety.failsafe))
Probably should be suffixed with
_enabled
or_active
(same for DP lockout).@ -176,0 +199,4 @@
fmt.Fprintf(w, "# HELP sem_dp_lockout Whether diffusion pump lockout is active (boolean)\n")
fmt.Fprintf(w, "# TYPE sem_dp_lockout gauge\n")
fmt.Fprintf(w, "sem_dp_lockout %f\n", boolToFloat(state.safety.highPressure))
If we rename this to
sem_pump_diffusion_...
andsem_pump_roughing_...
we could then have aggregate metrics on eg. regexsem_pump_.*_running
.@ -176,0 +210,4 @@
fmt.Fprintf(w, "sem_rp_running %f\n", boolToFloat(state.rpOn))
rough, high := state.vacuumStatus()
fmt.Fprintf(w, "# HELP sem_vacuum_rough_reached Whether the rough vacuum threshold has been passed (boolean)\n")
passed
in comment,reached
in metric name - let's settle on one term?@ -190,5 +197,13 @@ func (d *daemon) processOnce(_ context.Context) error {
}
}
// Update operating time counters
Instead of tying this into the processing loop elapsed calculation (which I meant as just an implementation detail), I would express machine run counters this way:
and then
As mentioned in the PR description, the counters only makes sense to keep if we change them to represent the total operating time with persistence on the succbone itself. Without that, the counters are superfluous because the same information can be extracted from e.g.
sem_pump_roughing_running
.I wanted to hear your opinion on this first before either dropping the commit or implementing a persistent operating time counter. Essentially the decision hinges on
Ah, sorry, didn't exactly understand.
A counter will just work even if you don't persist the value - that's the point of a counter vs. a gauge. Ie., if the reported value resets to zero, prometheus will assume a restart of the process and keep counting form the previous value. In other words, it's a way to express a monotonically increasing global value with non-monotonically increasing distributed counters.
I would prefer using a counter with
increase
here instead of relying onsum_over_time
as that's likely significantly more performant (I expectsum_over_time
to have to scan through all recorded data within the window, whileincrease
should only need to check the first and last datapoint of the window - effectively doing the running tally preprocessing ahead of time vs. at query time).So that's exactly what I though and why I started adding the counter implementation in the first place.
But it seems this isn't actually correct — prometheus only does the "account for resets" as part of evaluating
increase()
(which is justrate()
under the hood). It doesn't actually persist the "absolute count" in any way. So to get an accurate total, you'd need to have prometheus keep all historical data and as far as I can tell,increase()
will also need to scan that entire range then. Which is just as bad as thesum_over_time()
...So at this point, to me, neither option looks great. The only "performant" and accurate solution is keeping the total value on the succbone and only passing that along as a metric. And mainly I am wondering whether that is worth the effort...
a16dc6b73e
toc031fa5a43
Apart from the counter you seem to be vary about, is there anything stopping this PR from being merged? @hugo and I wasted some time today expanding the metrics (#15 :P) because we didn't check open PRs of course.
@q3k, ping?
c031fa5a43
to0aba323779
LGTM
(please wait while I fix CI)