More succbone metrics #14

rahix · 2024-10-06T09:38:54Z

rahix commented

2024-10-06 09:38:54 +00:00

As discussed. I am unsure about the second commit - it seems to me that uptime via the counter

increase(sem_rp_operating_seconds_total[10y])

and uptime via totalizing the status metric

sum_over_time(sem_rp_running[10y])

are equivalent.

If I understand correctly, the counter would only be beneficial if we would track its total value across succd restarts on the succbone because then totalizing the value doesn't depend on the data retention of prometheus. Adding that is of course non-trivial if we want it to be robust.

What are your thoughts?

As discussed. I am unsure about the second commit - it seems to me that uptime via the counter ``` increase(sem_rp_operating_seconds_total[10y]) ``` and uptime via totalizing the status metric ``` sum_over_time(sem_rp_running[10y]) ``` are equivalent. If I understand correctly, the counter would only be beneficial if we would track its total value across `succd` restarts on the succbone because then totalizing the value doesn't depend on the data retention of prometheus. Adding that is of course non-trivial if we want it to be robust. What are your thoughts?

rahix added the

succd

label 2024-10-06 09:38:54 +00:00

rahix added 2 commits 2024-10-06 09:38:54 +00:00

succd: Export all process values as prometheus metrics 918a020e68

For more detailed monitoring, let's export all process values that are
exposed to the web API as prometheus metrics.

succd: Add metrics for RP and RP operating time

/ test (push) Failing after 3s

Details

/ test (pull_request) Failing after 2s

Details

2cfea46852

Add counter metrics that count the total operating time for the roughing
pump and the diffusion pump.

rahix force-pushed metrics from 2cfea46852 to a16dc6b73e

2024-10-06 09:40:24 +00:00

Compare

rahix requested review from q3k 2024-10-06 09:41:23 +00:00

q3k requested changes 2024-10-06 15:08:59 +00:00

Dismissed

succbone/succd/http.go Outdated

					
				@ -171,2 +179,3 @@

					state := s.d.snapshot()

					mbar := state.piraniMbar100.mbar

					// sem_pressure_mbar is meant to represent the fused pressure value

q3k commented

2024-10-06 14:46:12 +00:00

Let's wait until we have more data sources and work on the model then - I don't think there's much sense in doing this ahead of time?

rahix commented

2024-10-07 05:47:40 +00:00

My main motivation here was to not break the existing metric but keep consistency in the naming - a sem_pressure_volts would not make sense but having sem_pressure_mbar alongside sem_pirani_volts is also not nice. So by duplicating the metric, the names are consistent and we don't break already recorded metrics data for no good reason.

My main motivation here was to not break the existing metric but keep consistency in the naming - a `sem_pressure_volts` would not make sense but having `sem_pressure_mbar` alongside `sem_pirani_volts` is also not nice. So by duplicating the metric, the names are consistent and we don't break already recorded metrics data for no good reason.

succbone/succd/http.go Outdated

					
				@ -176,0 +195,4 @@

					fmt.Fprintf(w, "# HELP sem_pirani_failsafe Whether pirani gauge failsafe mode is active (boolean)\n")

					fmt.Fprintf(w, "# TYPE sem_pirani_failsafe gauge\n")

					fmt.Fprintf(w, "sem_pirani_failsafe %f\n", boolToFloat(state.safety.failsafe))

q3k commented

2024-10-06 14:49:32 +00:00

Probably should be suffixed with _enabled or _active (same for DP lockout).

Probably should be suffixed with `_enabled` or `_active` (same for DP lockout).

rahix marked this conversation as resolved

succbone/succd/http.go Outdated

					
				@ -176,0 +199,4 @@

					fmt.Fprintf(w, "# HELP sem_dp_lockout Whether diffusion pump lockout is active (boolean)\n")

					fmt.Fprintf(w, "# TYPE sem_dp_lockout gauge\n")

					fmt.Fprintf(w, "sem_dp_lockout %f\n", boolToFloat(state.safety.highPressure))

q3k commented

2024-10-06 14:51:11 +00:00

If we rename this to sem_pump_diffusion_... and sem_pump_roughing_... we could then have aggregate metrics on eg. regex sem_pump_.*_running.

If we rename this to `sem_pump_diffusion_...` and `sem_pump_roughing_...` we could then have aggregate metrics on eg. regex `sem_pump_.*_running`.

rahix marked this conversation as resolved

succbone/succd/http.go Outdated

					
				@ -176,0 +210,4 @@

					fmt.Fprintf(w, "sem_rp_running %f\n", boolToFloat(state.rpOn))

					rough, high := state.vacuumStatus()

					fmt.Fprintf(w, "# HELP sem_vacuum_rough_reached Whether the rough vacuum threshold has been passed (boolean)\n")

q3k commented

2024-10-06 14:52:04 +00:00

passed in comment, reached in metric name - let's settle on one term?

`passed` in comment, `reached` in metric name - let's settle on one term?

rahix marked this conversation as resolved

succbone/succd/process.go Outdated

					
				@ -190,5 +197,13 @@ func (d *daemon) processOnce(_ context.Context) error {

						}

					}

					// Update operating time counters

q3k commented

2024-10-06 15:02:13 +00:00

Instead of tying this into the processing loop elapsed calculation (which I meant as just an implementation detail), I would express machine run counters this way:

struct runCounter {
    started time.Time
    running bool
}

func (r *runCounter) process(running bool) time.Duration {
    if !running {
        return time.Duration(0)
    }
    now := time.Now()
    if !r.running && running {
        r.started = now
    }
    return now.Sub(r.started)
}

and then

d.rpOperatingTime = d.rpOperationCounter.process(now)

Instead of tying this into the processing loop elapsed calculation (which I meant as just an implementation detail), I would express machine run counters this way: ``` struct runCounter { started time.Time running bool } func (r *runCounter) process(running bool) time.Duration { if !running { return time.Duration(0) } now := time.Now() if !r.running && running { r.started = now } return now.Sub(r.started) } ``` and then ``` d.rpOperatingTime = d.rpOperationCounter.process(now) ```

rahix commented

2024-10-07 05:42:31 +00:00

As mentioned in the PR description, the counters only makes sense to keep if we change them to represent the total operating time with persistence on the succbone itself. Without that, the counters are superfluous because the same information can be extracted from e.g. sem_pump_roughing_running.

I wanted to hear your opinion on this first before either dropping the commit or implementing a persistent operating time counter. Essentially the decision hinges on

How important is an operating hours counter to us
How much data retention is configured in prometheus - is it enough to get meaningful operating hours info

As mentioned in the PR description, the counters only makes sense to keep if we change them to represent the total operating time with persistence on the succbone itself. Without that, the counters are superfluous because the same information can be extracted from e.g. `sem_pump_roughing_running`. I wanted to hear your opinion on this first before either dropping the commit or implementing a persistent operating time counter. Essentially the decision hinges on - How important is an operating hours counter to us - How much data retention is configured in prometheus - is it enough to get meaningful operating hours info

q3k commented

2024-10-07 11:05:33 +00:00

Ah, sorry, didn't exactly understand.

A counter will just work even if you don't persist the value - that's the point of a counter vs. a gauge. Ie., if the reported value resets to zero, prometheus will assume a restart of the process and keep counting form the previous value. In other words, it's a way to express a monotonically increasing global value with non-monotonically increasing distributed counters.

I would prefer using a counter with increase here instead of relying on sum_over_time as that's likely significantly more performant (I expect sum_over_time to have to scan through all recorded data within the window, while increase should only need to check the first and last datapoint of the window - effectively doing the running tally preprocessing ahead of time vs. at query time).

Ah, sorry, didn't exactly understand. A counter will just work even if you don't persist the value - that's the point of a counter vs. a gauge. Ie., if the reported value resets to zero, prometheus will assume a restart of the process and keep counting form the previous value. In other words, it's a way to express a monotonically increasing global value with non-monotonically increasing distributed counters. I would prefer using a counter with `increase` here instead of relying on `sum_over_time` as that's likely significantly more performant (I expect `sum_over_time` to have to scan through all recorded data within the window, while `increase` should only need to check the first and last datapoint of the window - effectively doing the running tally preprocessing ahead of time vs. at query time).

rahix commented

2024-10-08 19:03:22 +00:00

So that's exactly what I though and why I started adding the counter implementation in the first place.

But it seems this isn't actually correct — prometheus only does the "account for resets" as part of evaluating increase() (which is just rate() under the hood). It doesn't actually persist the "absolute count" in any way. So to get an accurate total, you'd need to have prometheus keep all historical data and as far as I can tell, increase() will also need to scan that entire range then. Which is just as bad as the sum_over_time()...

So at this point, to me, neither option looks great. The only "performant" and accurate solution is keeping the total value on the succbone and only passing that along as a metric. And mainly I am wondering whether that is worth the effort...

So that's exactly what I though and why I started adding the counter implementation in the first place. But it seems this isn't actually correct — prometheus only does the "account for resets" as part of evaluating `increase()` (which is just `rate()` under the hood). It doesn't actually persist the "absolute count" in any way. So to get an accurate total, you'd need to have prometheus keep _all_ historical data and as far as I can tell, `increase()` will also need to scan that entire range then. Which is just as bad as the `sum_over_time()`... So at this point, to me, neither option looks great. The only "performant" and accurate solution is keeping the total value on the succbone and only passing that along as a metric. And mainly I am wondering whether that is worth the effort...

rahix force-pushed metrics from a16dc6b73e to c031fa5a43

2024-10-07 05:42:59 +00:00

Compare

hugo referenced this pull request

2024-10-26 18:48:40 +00:00

Add more metrics #15

0ff commented

2024-10-26 19:01:23 +00:00

Apart from the counter you seem to be vary about, is there anything stopping this PR from being merged? @hugo and I wasted some time today expanding the metrics (#15 :P) because we didn't check open PRs of course.