Skip to content

Commit 86b494a

Browse files
matheus23Arqurklaehndivagant-martianflub
authored
feat: Collect metrics for direct connections & add opt-in push metrics (#2805)
## New metrics 1. `nodes_contacted`: Number of nodes we have attempted to contact. 2. `nodes_contacted_directly`: Number of nodes we have managed to contact directly. 3. `connection_handshake_success`: Number of connections with a successful handshake. 4. `connection_became_direct`: Number of connections that became direct. This sets up to compute some core metrics regarding direct connections: 1. `nodes_contacted_directly` / `nodes_contacted`: Ratio of node pairs that manage to connect directly. 2. `connection_became_direct` / `connection_handshake_success`: Ratio of connections that can communicate directly Both of these metrics give us an idea on the number of network conditions that allow us to avoid relay traffic. Our hypothesis is that this number is around 90% & that it's similar between these two metrics, but it's worth capturing both to see if we're right. We also already capture enough information to get numbers on direct vs. relayed traffic volume: `recv_data_relay` / (`recv_data_ipv4` + `recv_data_ipv6`): Ratio of relay data vs. direct data. This metric will help us explain how much bandwidth you can save by using iroh-net. ## ~~Opt-in Push Metrics Exporter~~ Initially this PR added some config to the iroh-cli to allow configuring a push metrics exporter. I removed this after discussions in the RFC. It's not the ideal way to capture metrics except for internal use. We still retain the changes in `iroh_metrics`, to make use of a push metrics exporter for internal use, though. ## Breaking Changes - `MagicsockMetrics` is now `#[non_exhaustive]`. This allows us to add more metrics without breaking backwards compatibility in the future. The struct is not meant to be constructed outside of `iroh-net` anyways. ## Change checklist - [x] Self-review. - [x] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [x] Tests if relevant. (We tested this manually at the retreat at least. We don't have metrics tests at the moment.) - [x] All breaking changes documented. --------- Co-authored-by: Asmir Avdicevic <asmir.avdicevic64@gmail.com> Co-authored-by: Ruediger Klaehn <rklaehn@protonmail.com> Co-authored-by: Diva M <divma@protonmail.com> Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com> Co-authored-by: Floris Bruynooghe <flub@n0.computer>
1 parent ed13453 commit 86b494a

File tree

6 files changed

+148
-8
lines changed

6 files changed

+148
-8
lines changed

iroh-metrics/src/lib.rs

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,3 +76,33 @@ pub fn parse_prometheus_metrics(data: &str) -> anyhow::Result<HashMap<String, f6
7676
}
7777
Ok(metrics)
7878
}
79+
80+
/// Configuration for pushing metrics to a remote endpoint.
81+
#[derive(PartialEq, Eq, Debug, Default, serde::Deserialize, Clone)]
82+
pub struct PushMetricsConfig {
83+
/// The push interval in seconds.
84+
pub interval: u64,
85+
/// The endpoint url for the push metrics collector.
86+
pub endpoint: String,
87+
/// The name of the service you're exporting metrics for.
88+
///
89+
/// Generally, `metrics_exporter` is good enough for use
90+
/// outside of production deployments.
91+
pub service_name: String,
92+
/// The name of the instance you're exporting metrics for.
93+
///
94+
/// This should be device-unique. If not, this will sum up
95+
/// metrics from different devices.
96+
///
97+
/// E.g. `username-laptop`, `username-phone`, etc.
98+
///
99+
/// Another potential scheme with good privacy would be a
100+
/// keyed blake3 hash of the secret key. (This gives you
101+
/// an identifier that is as unique as a `NodeID`, but
102+
/// can't be correlated to `NodeID`s.)
103+
pub instance_name: String,
104+
/// The username for basic auth for the push metrics collector.
105+
pub username: Option<String>,
106+
/// The password for basic auth for the push metrics collector.
107+
pub password: String,
108+
}

iroh-metrics/src/metrics.rs

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,3 +62,17 @@ pub async fn start_metrics_dumper(
6262
) -> anyhow::Result<()> {
6363
crate::service::dumper(&path, interval).await
6464
}
65+
66+
/// Start a metrics exporter service.
67+
#[cfg(feature = "metrics")]
68+
pub async fn start_metrics_exporter(cfg: crate::PushMetricsConfig) {
69+
crate::service::exporter(
70+
cfg.endpoint,
71+
cfg.service_name,
72+
cfg.instance_name,
73+
cfg.username,
74+
cfg.password,
75+
std::time::Duration::from_secs(cfg.interval),
76+
)
77+
.await;
78+
}

iroh-metrics/src/service.rs

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ use std::{
77
use anyhow::{anyhow, Context, Result};
88
use hyper::{service::service_fn, Request, Response};
99
use tokio::{io::AsyncWriteExt as _, net::TcpListener};
10-
use tracing::{error, info};
10+
use tracing::{debug, error, info, warn};
1111

1212
use crate::{core::Core, parse_prometheus_metrics};
1313

@@ -116,3 +116,53 @@ async fn dump_metrics(
116116
}
117117
Ok(())
118118
}
119+
120+
/// Export metrics to a push gateway.
121+
pub async fn exporter(
122+
gateway_endpoint: String,
123+
service_name: String,
124+
instance_name: String,
125+
username: Option<String>,
126+
password: String,
127+
interval: Duration,
128+
) {
129+
let Some(core) = Core::get() else {
130+
error!("metrics disabled");
131+
return;
132+
};
133+
let push_client = reqwest::Client::new();
134+
let prom_gateway_uri = format!(
135+
"{}/metrics/job/{}/instance/{}",
136+
gateway_endpoint, service_name, instance_name
137+
);
138+
loop {
139+
tokio::time::sleep(interval).await;
140+
let buff = core.encode();
141+
match buff {
142+
Err(e) => error!("Failed to encode metrics: {e:#}"),
143+
Ok(buff) => {
144+
let mut req = push_client.post(&prom_gateway_uri);
145+
if let Some(username) = username.clone() {
146+
req = req.basic_auth(username, Some(password.clone()));
147+
}
148+
let res = match req.body(buff).send().await {
149+
Ok(res) => res,
150+
Err(e) => {
151+
warn!("failed to push metrics: {}", e);
152+
continue;
153+
}
154+
};
155+
match res.status() {
156+
reqwest::StatusCode::OK => {
157+
debug!("pushed metrics to gateway");
158+
}
159+
_ => {
160+
warn!("failed to push metrics to gateway: {:?}", res);
161+
let body = res.text().await.unwrap();
162+
warn!("error body: {}", body);
163+
}
164+
}
165+
}
166+
}
167+
}
168+
}

iroh-net/src/endpoint/rtt_actor.rs

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,18 @@ use std::collections::HashMap;
55
use futures_concurrency::stream::stream_group;
66
use futures_lite::StreamExt;
77
use iroh_base::key::NodeId;
8+
use iroh_metrics::inc;
89
use tokio::{
910
sync::{mpsc, Notify},
1011
task::JoinHandle,
1112
time::Duration,
1213
};
1314
use tracing::{debug, error, info_span, trace, warn, Instrument};
1415

15-
use crate::magicsock::{ConnectionType, ConnectionTypeStream};
16+
use crate::{
17+
magicsock::{ConnectionType, ConnectionTypeStream},
18+
metrics::MagicsockMetrics,
19+
};
1620

1721
#[derive(Debug)]
1822
pub(super) struct RttHandle {
@@ -65,7 +69,9 @@ struct RttActor {
6569
///
6670
/// These are weak references so not to keep the connections alive. The key allows
6771
/// removing the corresponding stream from `conn_type_changes`.
68-
connections: HashMap<stream_group::Key, (quinn::WeakConnectionHandle, NodeId)>,
72+
/// The boolean is an indiciator of whether this connection was direct before.
73+
/// This helps establish metrics on number of connections that became direct.
74+
connections: HashMap<stream_group::Key, (quinn::WeakConnectionHandle, NodeId, bool)>,
6975
/// A way to notify the main actor loop to run over.
7076
///
7177
/// E.g. when a new stream was added.
@@ -113,8 +119,9 @@ impl RttActor {
113119
node_id: NodeId,
114120
) {
115121
let key = self.connection_events.insert(conn_type_changes);
116-
self.connections.insert(key, (connection, node_id));
122+
self.connections.insert(key, (connection, node_id, false));
117123
self.tick.notify_one();
124+
inc!(MagicsockMetrics, connection_handshake_success);
118125
}
119126

120127
/// Performs the congestion controller reset for a magic socket path change.
@@ -125,14 +132,19 @@ impl RttActor {
125132
/// happens commonly.
126133
fn do_reset_rtt(&mut self, item: Option<(stream_group::Key, ConnectionType)>) {
127134
match item {
128-
Some((key, new_conn_type)) => match self.connections.get(&key) {
129-
Some((handle, node_id)) => {
135+
Some((key, new_conn_type)) => match self.connections.get_mut(&key) {
136+
Some((handle, node_id, was_direct_before)) => {
130137
if handle.reset_congestion_state() {
131138
debug!(
132139
node_id = %node_id.fmt_short(),
133140
new_type = ?new_conn_type,
134141
"Congestion controller state reset",
135142
);
143+
if !*was_direct_before && matches!(new_conn_type, ConnectionType::Direct(_))
144+
{
145+
*was_direct_before = true;
146+
inc!(MagicsockMetrics, connection_became_direct);
147+
}
136148
} else {
137149
debug!(
138150
node_id = %node_id.fmt_short(),
@@ -151,7 +163,7 @@ impl RttActor {
151163

152164
/// Performs cleanup for closed connection.
153165
fn do_connections_cleanup(&mut self) {
154-
for (key, (handle, node_id)) in self.connections.iter() {
166+
for (key, (handle, node_id, _)) in self.connections.iter() {
155167
if !handle.is_alive() {
156168
trace!(node_id = %node_id.fmt_short(), "removing stale connection");
157169
self.connection_events.remove(*key);

iroh-net/src/magicsock/metrics.rs

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ use iroh_metrics::{
66
/// Enum of metrics for the module
77
#[allow(missing_docs)]
88
#[derive(Debug, Clone, Iterable)]
9+
#[non_exhaustive]
910
pub struct Metrics {
1011
pub re_stun_calls: Counter,
1112
pub update_direct_addrs: Counter,
@@ -68,6 +69,16 @@ pub struct Metrics {
6869
pub actor_tick_direct_addr_update_receiver: Counter,
6970
pub actor_link_change: Counter,
7071
pub actor_tick_other: Counter,
72+
73+
/// Number of nodes we have attempted to contact.
74+
pub nodes_contacted: Counter,
75+
/// Number of nodes we have managed to contact directly.
76+
pub nodes_contacted_directly: Counter,
77+
78+
/// Number of connections with a successful handshake.
79+
pub connection_handshake_success: Counter,
80+
/// Number of connections with a successful handshake that became direct.
81+
pub connection_became_direct: Counter,
7182
}
7283

7384
impl Default for Metrics {
@@ -132,6 +143,12 @@ impl Default for Metrics {
132143
),
133144
actor_link_change: Counter::new("actor_link_change"),
134145
actor_tick_other: Counter::new("actor_tick_other"),
146+
147+
nodes_contacted: Counter::new("nodes_contacted"),
148+
nodes_contacted_directly: Counter::new("nodes_contacted_directly"),
149+
150+
connection_handshake_success: Counter::new("connection_handshake_success"),
151+
connection_became_direct: Counter::new("connection_became_direct"),
135152
}
136153
}
137154
}

iroh-net/src/magicsock/node_map/node_state.rs

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,10 @@ pub(super) struct NodeState {
134134
last_call_me_maybe: Option<Instant>,
135135
/// The type of connection we have to the node, either direct, relay, mixed, or none.
136136
conn_type: Watchable<ConnectionType>,
137+
/// Whether the conn_type was ever observed to be `Direct` at some point.
138+
///
139+
/// Used for metric reporting.
140+
has_been_direct: bool,
137141
}
138142

139143
/// Options for creating a new [`NodeState`].
@@ -173,6 +177,7 @@ impl NodeState {
173177
last_used: options.active.then(Instant::now),
174178
last_call_me_maybe: None,
175179
conn_type: Watchable::new(ConnectionType::None),
180+
has_been_direct: false,
176181
}
177182
}
178183

@@ -304,6 +309,10 @@ impl NodeState {
304309
(None, Some(relay_url)) => ConnectionType::Relay(relay_url),
305310
(None, None) => ConnectionType::None,
306311
};
312+
if !self.has_been_direct && matches!(&typ, ConnectionType::Direct(_)) {
313+
self.has_been_direct = true;
314+
inc!(MagicsockMetrics, nodes_contacted_directly);
315+
}
307316
if let Ok(prev_typ) = self.conn_type.update(typ.clone()) {
308317
// The connection type has changed.
309318
event!(
@@ -1132,7 +1141,11 @@ impl NodeState {
11321141
have_ipv6: bool,
11331142
) -> (Option<SocketAddr>, Option<RelayUrl>, Vec<PingAction>) {
11341143
let now = Instant::now();
1135-
self.last_used.replace(now);
1144+
let prev = self.last_used.replace(now);
1145+
if prev.is_none() {
1146+
// this is the first time we are trying to connect to this node
1147+
inc!(MagicsockMetrics, nodes_contacted);
1148+
}
11361149
let (udp_addr, relay_url) = self.addr_for_send(&now, have_ipv6);
11371150
let mut ping_msgs = Vec::new();
11381151

@@ -1485,6 +1498,7 @@ mod tests {
14851498
last_used: Some(now),
14861499
last_call_me_maybe: None,
14871500
conn_type: Watchable::new(ConnectionType::Direct(ip_port.into())),
1501+
has_been_direct: true,
14881502
},
14891503
ip_port.into(),
14901504
)
@@ -1504,6 +1518,7 @@ mod tests {
15041518
last_used: Some(now),
15051519
last_call_me_maybe: None,
15061520
conn_type: Watchable::new(ConnectionType::Relay(send_addr.clone())),
1521+
has_been_direct: false,
15071522
}
15081523
};
15091524

@@ -1530,6 +1545,7 @@ mod tests {
15301545
last_used: Some(now),
15311546
last_call_me_maybe: None,
15321547
conn_type: Watchable::new(ConnectionType::Relay(send_addr.clone())),
1548+
has_been_direct: false,
15331549
}
15341550
};
15351551

@@ -1569,6 +1585,7 @@ mod tests {
15691585
socket_addr,
15701586
send_addr.clone(),
15711587
)),
1588+
has_been_direct: false,
15721589
},
15731590
socket_addr,
15741591
)

0 commit comments

Comments
 (0)