Skip to content

Commit 9b88b4e

Browse files
Remove mutex lock around record_progress endpoint
See comment added in source for why this is OK. We are trying to reduce the runtime of this endpoint, as the longer it takes the fewer results we can receive. Obviously, the agents submitting results are the true bound on performance, but if the server storing results is near capacity (as we currently are, on a single thread), then we may be artificially reducing our throughput. Certainly, the endpoint performance is a bound on the *possible* highest level of throughput with an infinitely fast agent pool.
1 parent 191ea00 commit 9b88b4e

File tree

1 file changed

+12
-8
lines changed

1 file changed

+12
-8
lines changed

src/server/routes/agent.rs

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ pub fn routes(
5959
.and(warp::path("record-progress"))
6060
.and(warp::path::end())
6161
.and(warp::body::json())
62-
.and(mutex_filter.clone())
62+
.and(data_filter.clone())
6363
.and(auth_filter(data.clone(), TokenType::Agent))
6464
.map(endpoint_record_progress);
6565

@@ -154,14 +154,22 @@ fn endpoint_next_experiment(
154154
Ok(ApiResponse::Success { result }.into_response()?)
155155
}
156156

157+
// This endpoint does not use the mutex data wrapper to exclude running in
158+
// parallel with other endpoints, which may mean that we (for example) are
159+
// recording results for an abort'd experiment. This should generally be fine --
160+
// the database already has foreign_keys enabled and that should ensure
161+
// appropriate synchronization elsewhere. (If it doesn't, that's mostly a bug
162+
// elsewhere, not here).
163+
//
164+
// In practice it's pretty likely that we won't fully run in parallel anyway,
165+
// but this lets some of the work proceed without the lock being held, which is
166+
// generally positive.
157167
fn endpoint_record_progress(
158168
result: ExperimentData<ProgressData>,
159-
mutex: Arc<Mutex<Data>>,
169+
data: Arc<Data>,
160170
auth: AuthDetails,
161171
) -> Fallible<Response<Body>> {
162172
let start = std::time::Instant::now();
163-
let data = mutex.lock().unwrap();
164-
let lock_wait = start.elapsed();
165173
let ex = Experiment::get(&data.db, &result.experiment_name)?
166174
.ok_or_else(|| err_msg("no experiment run by this agent"))?;
167175

@@ -176,10 +184,6 @@ fn endpoint_record_progress(
176184
.crater_endpoint_time
177185
.with_label_values(&["record_progress"])
178186
.observe(start.elapsed().as_secs_f64());
179-
data.metrics
180-
.crater_endpoint_time
181-
.with_label_values(&["record_progress_lock"])
182-
.observe(lock_wait.as_secs_f64());
183187
ret
184188
}
185189

0 commit comments

Comments
 (0)