More succbone metrics #14

Merged
q3k merged 1 commit from metrics into main 2024-11-09 22:20:38 +00:00
Owner

As discussed. I am unsure about the second commit - it seems to me that uptime via the counter

increase(sem_rp_operating_seconds_total[10y])

and uptime via totalizing the status metric

sum_over_time(sem_rp_running[10y])

are equivalent.

If I understand correctly, the counter would only be beneficial if we would track its total value across succd restarts on the succbone because then totalizing the value doesn't depend on the data retention of prometheus. Adding that is of course non-trivial if we want it to be robust.

What are your thoughts?

As discussed. I am unsure about the second commit - it seems to me that uptime via the counter ``` increase(sem_rp_operating_seconds_total[10y]) ``` and uptime via totalizing the status metric ``` sum_over_time(sem_rp_running[10y]) ``` are equivalent. If I understand correctly, the counter would only be beneficial if we would track its total value across `succd` restarts on the succbone because then totalizing the value doesn't depend on the data retention of prometheus. Adding that is of course non-trivial if we want it to be robust. What are your thoughts?
rahix added the
succd
label 2024-10-06 09:38:54 +00:00
rahix added 2 commits 2024-10-06 09:38:54 +00:00
For more detailed monitoring, let's export all process values that are
exposed to the web API as prometheus metrics.
succd: Add metrics for RP and RP operating time
Some checks failed
/ test (push) Failing after 3s
/ test (pull_request) Failing after 2s
2cfea46852
Add counter metrics that count the total operating time for the roughing
pump and the diffusion pump.
rahix force-pushed metrics from 2cfea46852 to a16dc6b73e 2024-10-06 09:40:24 +00:00 Compare
rahix requested review from q3k 2024-10-06 09:41:23 +00:00
q3k requested changes 2024-10-06 15:08:59 +00:00
Dismissed
@ -171,2 +179,3 @@
state := s.d.snapshot()
mbar := state.piraniMbar100.mbar
// sem_pressure_mbar is meant to represent the fused pressure value
Owner

Let's wait until we have more data sources and work on the model then - I don't think there's much sense in doing this ahead of time?

Let's wait until we have more data sources and work on the model then - I don't think there's much sense in doing this ahead of time?
Author
Owner

My main motivation here was to not break the existing metric but keep consistency in the naming - a sem_pressure_volts would not make sense but having sem_pressure_mbar alongside sem_pirani_volts is also not nice. So by duplicating the metric, the names are consistent and we don't break already recorded metrics data for no good reason.

My main motivation here was to not break the existing metric but keep consistency in the naming - a `sem_pressure_volts` would not make sense but having `sem_pressure_mbar` alongside `sem_pirani_volts` is also not nice. So by duplicating the metric, the names are consistent and we don't break already recorded metrics data for no good reason.
@ -176,0 +195,4 @@
fmt.Fprintf(w, "# HELP sem_pirani_failsafe Whether pirani gauge failsafe mode is active (boolean)\n")
fmt.Fprintf(w, "# TYPE sem_pirani_failsafe gauge\n")
fmt.Fprintf(w, "sem_pirani_failsafe %f\n", boolToFloat(state.safety.failsafe))
Owner

Probably should be suffixed with _enabled or _active (same for DP lockout).

Probably should be suffixed with `_enabled` or `_active` (same for DP lockout).
rahix marked this conversation as resolved
@ -176,0 +199,4 @@
fmt.Fprintf(w, "# HELP sem_dp_lockout Whether diffusion pump lockout is active (boolean)\n")
fmt.Fprintf(w, "# TYPE sem_dp_lockout gauge\n")
fmt.Fprintf(w, "sem_dp_lockout %f\n", boolToFloat(state.safety.highPressure))
Owner

If we rename this to sem_pump_diffusion_... and sem_pump_roughing_... we could then have aggregate metrics on eg. regex sem_pump_.*_running.

If we rename this to `sem_pump_diffusion_...` and `sem_pump_roughing_...` we could then have aggregate metrics on eg. regex `sem_pump_.*_running`.
rahix marked this conversation as resolved
@ -176,0 +210,4 @@
fmt.Fprintf(w, "sem_rp_running %f\n", boolToFloat(state.rpOn))
rough, high := state.vacuumStatus()
fmt.Fprintf(w, "# HELP sem_vacuum_rough_reached Whether the rough vacuum threshold has been passed (boolean)\n")
Owner

passed in comment, reached in metric name - let's settle on one term?

`passed` in comment, `reached` in metric name - let's settle on one term?
rahix marked this conversation as resolved
@ -190,5 +197,13 @@ func (d *daemon) processOnce(_ context.Context) error {
}
}
// Update operating time counters
Owner

Instead of tying this into the processing loop elapsed calculation (which I meant as just an implementation detail), I would express machine run counters this way:

struct runCounter {
    started time.Time
    running bool
}

func (r *runCounter) process(running bool) time.Duration {
    if !running {
        return time.Duration(0)
    }
    now := time.Now()
    if !r.running && running {
        r.started = now
    }
    return now.Sub(r.started)
}

and then

d.rpOperatingTime = d.rpOperationCounter.process(now)
Instead of tying this into the processing loop elapsed calculation (which I meant as just an implementation detail), I would express machine run counters this way: ``` struct runCounter { started time.Time running bool } func (r *runCounter) process(running bool) time.Duration { if !running { return time.Duration(0) } now := time.Now() if !r.running && running { r.started = now } return now.Sub(r.started) } ``` and then ``` d.rpOperatingTime = d.rpOperationCounter.process(now) ```
Author
Owner

As mentioned in the PR description, the counters only makes sense to keep if we change them to represent the total operating time with persistence on the succbone itself. Without that, the counters are superfluous because the same information can be extracted from e.g. sem_pump_roughing_running.

I wanted to hear your opinion on this first before either dropping the commit or implementing a persistent operating time counter. Essentially the decision hinges on

  • How important is an operating hours counter to us
  • How much data retention is configured in prometheus - is it enough to get meaningful operating hours info
As mentioned in the PR description, the counters only makes sense to keep if we change them to represent the total operating time with persistence on the succbone itself. Without that, the counters are superfluous because the same information can be extracted from e.g. `sem_pump_roughing_running`. I wanted to hear your opinion on this first before either dropping the commit or implementing a persistent operating time counter. Essentially the decision hinges on - How important is an operating hours counter to us - How much data retention is configured in prometheus - is it enough to get meaningful operating hours info
Owner

Ah, sorry, didn't exactly understand.

A counter will just work even if you don't persist the value - that's the point of a counter vs. a gauge. Ie., if the reported value resets to zero, prometheus will assume a restart of the process and keep counting form the previous value. In other words, it's a way to express a monotonically increasing global value with non-monotonically increasing distributed counters.

I would prefer using a counter with increase here instead of relying on sum_over_time as that's likely significantly more performant (I expect sum_over_time to have to scan through all recorded data within the window, while increase should only need to check the first and last datapoint of the window - effectively doing the running tally preprocessing ahead of time vs. at query time).

Ah, sorry, didn't exactly understand. A counter will just work even if you don't persist the value - that's the point of a counter vs. a gauge. Ie., if the reported value resets to zero, prometheus will assume a restart of the process and keep counting form the previous value. In other words, it's a way to express a monotonically increasing global value with non-monotonically increasing distributed counters. I would prefer using a counter with `increase` here instead of relying on `sum_over_time` as that's likely significantly more performant (I expect `sum_over_time` to have to scan through all recorded data within the window, while `increase` should only need to check the first and last datapoint of the window - effectively doing the running tally preprocessing ahead of time vs. at query time).
Author
Owner

So that's exactly what I though and why I started adding the counter implementation in the first place.

But it seems this isn't actually correct — prometheus only does the "account for resets" as part of evaluating increase() (which is just rate() under the hood). It doesn't actually persist the "absolute count" in any way. So to get an accurate total, you'd need to have prometheus keep all historical data and as far as I can tell, increase() will also need to scan that entire range then. Which is just as bad as the sum_over_time()...

So at this point, to me, neither option looks great. The only "performant" and accurate solution is keeping the total value on the succbone and only passing that along as a metric. And mainly I am wondering whether that is worth the effort...

So that's exactly what I though and why I started adding the counter implementation in the first place. But it seems this isn't actually correct — prometheus only does the "account for resets" as part of evaluating `increase()` (which is just `rate()` under the hood). It doesn't actually persist the "absolute count" in any way. So to get an accurate total, you'd need to have prometheus keep _all_ historical data and as far as I can tell, `increase()` will also need to scan that entire range then. Which is just as bad as the `sum_over_time()`... So at this point, to me, neither option looks great. The only "performant" and accurate solution is keeping the total value on the succbone and only passing that along as a metric. And mainly I am wondering whether that is worth the effort...
rahix force-pushed metrics from a16dc6b73e to c031fa5a43 2024-10-07 05:42:59 +00:00 Compare
Owner

Apart from the counter you seem to be vary about, is there anything stopping this PR from being merged? @hugo and I wasted some time today expanding the metrics (#15 :P) because we didn't check open PRs of course.

Apart from the counter you seem to be vary about, is there anything stopping this PR from being merged? @hugo and I wasted some time today expanding the metrics (#15 :P) because we didn't check open PRs of course.
Author
Owner

@q3k, ping?

@q3k, ping?
hugo force-pushed metrics from c031fa5a43 to 0aba323779 2024-11-09 22:03:51 +00:00 Compare
Owner

LGTM

LGTM
q3k approved these changes 2024-11-09 22:09:03 +00:00
hugo scheduled this pull request to auto merge when all checks succeed 2024-11-09 22:09:46 +00:00
Owner

(please wait while I fix CI)

(please wait while I fix CI)
q3k scheduled this pull request to auto merge when all checks succeed 2024-11-09 22:19:00 +00:00
q3k merged commit 0aba323779 into main 2024-11-09 22:20:38 +00:00
Sign in to join this conversation.
No reviewers
q3k
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: fafo/jeol-t330a#14
No description provided.