Skip to content

p2p: fix dial metrics not picking up the right error #31621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 15, 2025

Conversation

cskiraly
Copy link
Contributor

@cskiraly cskiraly commented Apr 12, 2025

Our metrics related to dial errors were off. The original error was not wrapped, so the caller function had no chance of picking it up. Therefore the most common error, which is "TooManyPeers", was not correctly counted.

The metrics were originally introduced in #27621

I was thinking of various possible solutions.

  • the one proposed here wraps both the new error and the origial error. It is not a pattern we use in other parts of the code, but works. This is maybe the smallest possible change.
  • as an alternate, I could write a proper errProtoHandshakeError with it's own wrapped error
  • finally, I'm not even sure we need errProtoHandshakeError, maybe we could just pass up the original error.

The original error was not wrapped, so the caller function had
no chance of picking it up.

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
@cskiraly cskiraly requested a review from lightclient April 12, 2025 13:21
@cskiraly cskiraly marked this pull request as ready for review April 13, 2025 13:19
@cskiraly cskiraly requested review from fjl and zsfelfoldi as code owners April 13, 2025 13:20
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
@@ -67,23 +67,22 @@ func markDialError(err error) {
if !metrics.Enabled() {
return
}
if err2 := errors.Unwrap(err); err2 != nil {
Copy link
Contributor

@fjl fjl Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we really want here is

var reason DiscReason
if errors.As(err, &reason) {
    switch reason {
    ...
    }
}
var phe *protoHandshakeError
if errors.As(err, &phe) {
    dialProtoHandshakeError.Mark(1)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, and also cleaner. I will verify how it works, and come back with a patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dialProtoHandshakeError would have to be in an else branch, and we would still need to handle dialEncHandshakeError separately.
I would say the original with Is and As is better. I've added a catch-all and some comments to make it even cleaner.

cskiraly and others added 5 commits April 15, 2025 02:21
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
We can't pass errProtoHandshakeError here, because the framework uses
`Is` for testing, and that would need `As`. We can however test how
it works for internal errors, which is the most typical case.

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
@fjl fjl added this to the 1.15.9 milestone Apr 15, 2025
@cskiraly
Copy link
Contributor Author

LGTM

@fjl fjl merged commit 6928ec5 into ethereum:master Apr 15, 2025
3 of 4 checks passed
sivaratrisrinivas pushed a commit to sivaratrisrinivas/go-ethereum that referenced this pull request Apr 21, 2025
Our metrics related to dial errors were off. The original error was not
wrapped, so the caller function had no chance of picking it up.
Therefore the most common error, which is "TooManyPeers", was not
correctly counted.

The metrics were originally introduced in
ethereum#27621

I was thinking of various possible solutions.
- the one proposed here wraps both the new error and the origial error.
It is not a pattern we use in other parts of the code, but works. This
is maybe the smallest possible change.
- as an alternate, I could write a proper `errProtoHandshakeError` with
it's own wrapped error
- finally, I'm not even sure we need `errProtoHandshakeError`, maybe we
could just pass up the original error.

---------

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
Co-authored-by: Felix Lange <fjl@twurst.com>
0g-wh pushed a commit to 0glabs/0g-geth that referenced this pull request Apr 22, 2025
Our metrics related to dial errors were off. The original error was not
wrapped, so the caller function had no chance of picking it up.
Therefore the most common error, which is "TooManyPeers", was not
correctly counted.

The metrics were originally introduced in
ethereum#27621

I was thinking of various possible solutions.
- the one proposed here wraps both the new error and the origial error.
It is not a pattern we use in other parts of the code, but works. This
is maybe the smallest possible change.
- as an alternate, I could write a proper `errProtoHandshakeError` with
it's own wrapped error
- finally, I'm not even sure we need `errProtoHandshakeError`, maybe we
could just pass up the original error.

---------

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
Co-authored-by: Felix Lange <fjl@twurst.com>
0g-wh pushed a commit to 0g-wh/0g-geth that referenced this pull request May 8, 2025
Our metrics related to dial errors were off. The original error was not
wrapped, so the caller function had no chance of picking it up.
Therefore the most common error, which is "TooManyPeers", was not
correctly counted.

The metrics were originally introduced in
ethereum#27621

I was thinking of various possible solutions.
- the one proposed here wraps both the new error and the origial error.
It is not a pattern we use in other parts of the code, but works. This
is maybe the smallest possible change.
- as an alternate, I could write a proper `errProtoHandshakeError` with
it's own wrapped error
- finally, I'm not even sure we need `errProtoHandshakeError`, maybe we
could just pass up the original error.

---------

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
Co-authored-by: Felix Lange <fjl@twurst.com>
Rampex1 pushed a commit to streamingfast/go-ethereum that referenced this pull request May 15, 2025
Our metrics related to dial errors were off. The original error was not
wrapped, so the caller function had no chance of picking it up.
Therefore the most common error, which is "TooManyPeers", was not
correctly counted.

The metrics were originally introduced in
ethereum#27621

I was thinking of various possible solutions.
- the one proposed here wraps both the new error and the origial error.
It is not a pattern we use in other parts of the code, but works. This
is maybe the smallest possible change.
- as an alternate, I could write a proper `errProtoHandshakeError` with
it's own wrapped error
- finally, I'm not even sure we need `errProtoHandshakeError`, maybe we
could just pass up the original error.

---------

Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
Co-authored-by: Felix Lange <fjl@twurst.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants