consensus missed round metric#119
Conversation
|
@Brindrajsinh-Chauhan thanks for contribution, could you fix the rust formatting/lint |
Signed-off-by: Brindrajsinh-Chauhan <brindrajsinh@gmail.com>
bcaccc9 to
66b9904
Compare
|
@ZhiyuCircle Would be grateful to know the next step to get this in or will this be first merged in internal repos and then reverse synced to upstream |
| #[derive(Clone, Copy, Debug, Hash, PartialEq, Eq, EncodeLabelSet)] | ||
| struct RoundMissedLabel { | ||
| proposer: AsLabelValue<Address>, | ||
| height: AsLabelValue<u64>, | ||
| round: AsLabelValue<i64>, | ||
| } |
There was a problem hiding this comment.
Given that height grows forever and round can spike during outages, this creates one time series per missed height/round/proposer for the process lifetime, which can grow unbounded very quickly if eg. a validator is down for an extended period of time.
We have actually seen such an issue on testnet with another metric which was recording outbound IP address with their ephemeral ports, causing problems when ingesting those metrics into Prometheus/Datadog.
I am not sure what the best solution here is, since I agree that having the height and round is useful information.
Curious to hear what you think!
In the meantime, I'll see if I can find a good solution for this.
There was a problem hiding this comment.
Thank you for the feedback, I will also think on this to find a solution for this
There was a problem hiding this comment.
I think at the moment, I should drop the height and round from the metric and just have missed proposer since that can be a good alerting metrics if there is a increase in missed round by a proposer. The warn log you suggested below could then be used to investigate the height and rounds the proposer missed. So alerting like
increase(consensus_rounds_missed_total{proposer="0xABC"}[5m]) > 0
could work. Would be grateful to get your thoughts on this. I know removing the labels is not the best possible options but gets us the important thing of being able alert / detect issues causing the proposer to miss multiple rounds
| if round.as_i64() > 0 { | ||
| let current_round = round.as_u32().expect("round is defined"); | ||
| let missed_round = Round::new(current_round.saturating_sub(1)); | ||
| let missed_proposer = state | ||
| .ctx | ||
| .proposer_selector | ||
| .select_proposer(state.validator_set(), height, missed_round) | ||
| .address; | ||
| state | ||
| .metrics() | ||
| .inc_consensus_round_missed(missed_proposer, height, missed_round); | ||
| } |
There was a problem hiding this comment.
This undercounts missed rounds if consensus jumps from round 0 to round N via Tendermint skip-round handling.
| if round.as_i64() > 0 { | |
| let current_round = round.as_u32().expect("round is defined"); | |
| let missed_round = Round::new(current_round.saturating_sub(1)); | |
| let missed_proposer = state | |
| .ctx | |
| .proposer_selector | |
| .select_proposer(state.validator_set(), height, missed_round) | |
| .address; | |
| state | |
| .metrics() | |
| .inc_consensus_round_missed(missed_proposer, height, missed_round); | |
| } | |
| let mut missed_round = state.current_round.or(Round::ZERO); | |
| while missed_round < round { | |
| let missed_proposer = state | |
| .ctx | |
| .proposer_selector | |
| .select_proposer(state.validator_set(), height, missed_round) | |
| .address; | |
| warn!(%height, %missed_round, %missed_proposer, "Consensus round missed"); | |
| state | |
| .metrics() | |
| .inc_consensus_round_missed(missed_proposer, height, missed_round); | |
| missed_round = missed_round.increment(); | |
| } |
@Brindrajsinh-Chauhan after second pass review, we have few concerns like @romac mention above |
Signed-off-by: Brindrajsinh-Chauhan <brindrajsinh@gmail.com>
This adds a metrics to measure the missed rounds with labels including the proposer height and the round missed. This is helpful to create alerting rules for missed rounds and identify validators that might be having issues.