Skip to content

Commit 0f0ff9b

Browse files
rix0rrrgithub-actions
and
github-actions
authored
fix(npmjs): follower hangs when doing a backfill (#1721)
For reasons of the NPMJS replica behaving strangely[1], our follower has changed to starting a full backfill of the entire catalog. However, when doing the full backfill we're seeing strange behavior: when doing fetches for a lot of packages in parallel, we notice that the follower stops printing anything and the Lambda just times out after 5 minutes of inactivity. Looking at the logs, 47 requests are sent out in parallel but only 15 are answered, and then nothing more happens. Looking at the code it is a mess of NodeJS event handling, and from what I can tell there is no handling of socket error events that occur in the request (just socket error events that occur in the response). Best theory: since these events are dropped on the floor, the corresponding Promise is never resolved or rejected and the follower will just hang. There also was no request timeout, I added that as well. I refactored the code to be more straightforward: - No recursion to do retries but a `while` loop - Use a deadline-driven timeout (previous code would retry indefinitely) - Put the classification of what makes an error retryable in once place - Switch to using `await new Promise(...)` instead of the callback style. - Importantly: add `req.on('error')` and `req.on('timeout')` to handle problems with the request. - Replace `x = new Error(); Error.captureStackTrace(x); ko(x)` with `throw new Error()`, since you can just do that inside the `new Promise()` constructor function. [1]: https://github.com/orgs/community/discussions/152515#discussioncomment-13053455 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license* --------- Signed-off-by: github-actions <github-actions@github.com> Co-authored-by: github-actions <github-actions@github.com>
1 parent d4fa755 commit 0f0ff9b

File tree

3 files changed

+101
-93
lines changed

3 files changed

+101
-93
lines changed

src/__tests__/__snapshots__/construct-hub.test.ts.snap

Lines changed: 5 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

src/__tests__/devapp/__snapshots__/snapshot.test.ts.snap

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

src/package-sources/npmjs/couch-changes.lambda-shared.ts

Lines changed: 95 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,17 @@
11
import { EventEmitter } from 'events';
22
import { OutgoingHttpHeaders } from 'http';
3-
import { Agent, request } from 'https';
3+
import { Agent, request, RequestOptions } from 'https';
44
import { Readable } from 'stream';
55
import { URL } from 'url';
66
import { createGunzip } from 'zlib';
77
import * as JSONStream from 'JSONStream';
88

99
const NPM_REGISTRY_URL = 'https://registry.npmjs.org/';
1010

11+
const REQUEST_DEADLINE_MS = 30_000;
12+
13+
const REQUEST_ATTEMPT_TIMEOUT_MS = 5_000;
14+
1115
/**
1216
* A utility class that helps with traversing CouchDB database changes streams
1317
* in a promise-based, page-by-page manner.
@@ -28,6 +32,9 @@ export class CouchChanges extends EventEmitter {
2832
keepAlive: true,
2933
keepAliveMsecs: 5_000,
3034
maxSockets: 4,
35+
36+
// This timeout is separate from the request timeout, and is here to
37+
// prevent stalled/idle connections
3138
timeout: 60_000,
3239
});
3340
this.baseUrl = new URL(baseUrl);
@@ -120,104 +127,105 @@ export class CouchChanges extends EventEmitter {
120127
*
121128
* @returns the JSON-decoded response body.
122129
*/
123-
private https(
124-
method: string,
130+
private async https(
131+
method: 'get' | 'post',
125132
url: URL,
126-
body?: { [key: string]: unknown },
127-
attempt = 1
133+
body?: { [key: string]: unknown }
128134
): Promise<{ [key: string]: unknown }> {
129-
return new Promise((ok, ko) => {
130-
const retry = () =>
131-
setTimeout(() => {
132-
console.log(`Retrying ${method.toUpperCase()} ${url}`);
133-
this.https(method, url, body, attempt + 1).then(ok, ko);
134-
}, Math.min(500 * attempt, 5_000));
135-
136-
const headers: OutgoingHttpHeaders = {
137-
Accept: 'application/json',
138-
'Accept-Encoding': 'gzip',
139-
'npm-replication-opt-in': 'true', // can be deleted after May 29: https://github.com/orgs/community/discussions/152515
140-
};
141-
if (body) {
142-
headers['Content-Type'] = 'application/json';
143-
}
135+
const headers: OutgoingHttpHeaders = {
136+
Accept: 'application/json',
137+
'Accept-Encoding': 'gzip',
138+
'npm-replication-opt-in': 'true', // can be deleted after May 29: https://github.com/orgs/community/discussions/152515
139+
};
140+
if (body) {
141+
headers['Content-Type'] = 'application/json';
142+
}
144143

145-
const req = request(
146-
url,
147-
{
148-
agent: this.agent,
149-
headers,
150-
method,
151-
port: 443,
152-
servername: url.hostname,
153-
},
154-
(res) => {
155-
if (res.statusCode == null) {
156-
const error = new Error(
157-
`[FATAL] Request failed: ${method.toUpperCase()} ${url}`
158-
);
159-
Error.captureStackTrace(error);
160-
return ko(error);
161-
}
162-
163-
console.log(
164-
`Response: ${method.toUpperCase()} ${url} => HTTP ${
165-
res.statusCode
166-
} (${res.statusMessage})`
167-
);
168-
169-
// Transient (server) errors:
170-
if (res.statusCode >= 500 && res.statusCode < 600) {
171-
console.error(
172-
`[RETRYABLE] HTTP ${res.statusCode} (${
173-
res.statusMessage
174-
}) - ${method.toUpperCase()} ${url}`
175-
);
176-
// Call again after a short back-off
177-
return retry();
178-
}
179-
// Permanent (client) errors:
180-
if (res.statusCode >= 400 && res.statusCode < 500) {
181-
const error = new Error(
182-
`[FATAL] HTTP ${res.statusCode} (${
183-
res.statusMessage
184-
}) - ${method.toUpperCase()} ${url}`
185-
);
186-
Error.captureStackTrace(error);
187-
return ko(error);
188-
}
189-
190-
const onError = (err: Error & { code?: string }) => {
191-
if (err.code === 'ECONNRESET') {
192-
// Transient networking problem?
193-
console.error(
194-
`[RETRYABLE] ${err.code} - ${method.toUpperCase()} ${url}`
144+
const requestOptions: RequestOptions = {
145+
agent: this.agent,
146+
headers,
147+
method,
148+
port: 443,
149+
servername: url.hostname,
150+
// This just leads to a 'timeout' event
151+
timeout: REQUEST_ATTEMPT_TIMEOUT_MS,
152+
};
153+
154+
const deadline = Date.now() + REQUEST_DEADLINE_MS;
155+
let maxDelay = 100;
156+
while (true) {
157+
try {
158+
return await new Promise((ok, ko) => {
159+
const req = request(url, requestOptions, (res) => {
160+
if (res.statusCode == null) {
161+
throw new RetryableError('No status code available');
162+
}
163+
164+
// Server errors. We can't know whether these are really retryable but we usually pretend that they are.
165+
if (res.statusCode >= 500 && res.statusCode < 600) {
166+
throw new RetryableError(
167+
`HTTP ${res.statusCode} ${res.statusMessage}`
195168
);
196-
retry();
197-
} else {
198-
Error.captureStackTrace(err);
199-
console.log('[NON-RETRYABLE]', err);
200-
ko(err);
201169
}
202-
};
203170

204-
res.once('error', onError);
171+
// Permanent (client) errors:
172+
if (res.statusCode >= 400 && res.statusCode < 500) {
173+
throw new Error(`HTTP ${res.statusCode} ${res.statusMessage}`);
174+
}
175+
176+
console.log(
177+
`Response: ${method.toUpperCase()} ${url} => HTTP ${
178+
res.statusCode
179+
} (${res.statusMessage})`
180+
);
181+
182+
res.once('error', ko);
183+
184+
const json = JSONStream.parse(true);
185+
json.once('data', ok);
186+
json.once('error', ko);
187+
188+
const plainPayload =
189+
res.headers['content-encoding'] === 'gzip' ? gunzip(res) : res;
190+
plainPayload.pipe(json, { end: true });
191+
plainPayload.once('error', ko);
192+
});
205193

206-
const json = JSONStream.parse(true);
207-
json.once('data', ok);
208-
json.once('error', onError);
194+
req.on('error', ko);
195+
req.on('timeout', () => {
196+
req.destroy(
197+
new RetryableError(
198+
`Timeout after ${REQUEST_ATTEMPT_TIMEOUT_MS}ms, aborting request`
199+
)
200+
);
201+
});
209202

210-
const plainPayload =
211-
res.headers['content-encoding'] === 'gzip' ? gunzip(res) : res;
212-
plainPayload.pipe(json, { end: true });
213-
plainPayload.once('error', onError);
203+
req.end(body && JSON.stringify(body, null, 2));
204+
});
205+
} catch (e: any) {
206+
if (Date.now() > deadline || !isRetryableError(e)) {
207+
throw e;
214208
}
215-
);
216-
req.end(body && JSON.stringify(body, null, 2));
217-
});
209+
210+
console.error(`[RETRYABLE] ${method} ${url}: ${e}`);
211+
212+
await sleep(Math.floor(Math.random() * maxDelay));
213+
maxDelay *= 2;
214+
}
215+
}
218216
}
219217
}
220218

219+
class RetryableError extends Error {}
220+
221+
function isRetryableError(e: Error): boolean {
222+
return e instanceof RetryableError || (e as any).code === 'ECONNRESET';
223+
}
224+
225+
async function sleep(ms: number) {
226+
return new Promise((ok) => setTimeout(ok, ms));
227+
}
228+
221229
export interface DatabaseChanges {
222230
/**
223231
* The last sequence ID from this change set. This is the value that should be

0 commit comments

Comments
 (0)