Issue crawling a web property with big PDFs

We (Kiwix) are struggling to crawl https://www.survivorlibrary.com/index.php/main-library-index/. 

Problem is linked to big files. 

We have already setup include rule to include only PDFs and exclude ZIP files which are known to be too big, but we now realize there is even huge PDFs.

For instance, problem occurs when crawling https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf which is a ~837MB file.

Logs:

```
{"timestamp":"2024-09-02T14:18:25.817Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf"}}
{"timestamp":"2024-09-02T14:18:25.817Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":2332,"total":14591,"pending":1,"failed":2,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-09-02T14:18:25.673Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.survivorlibrary.com\\/library\\/corn_and_corn_improvement_1955.pdf\",\"added\":\"2024-09-02T12:33:36.057Z\",\"depth\":2}"]}}
{"timestamp":"2024-09-02T14:18:55.819Z","logLevel":"warn","context":"fetch","message":"Direct fetch capture attempt timed out","details":{"seconds":30,"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:18:55.820Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:18:58.268Z","logLevel":"warn","context":"recorder","message":"Large streamed written to WARC, but not returned to browser, requires reading into memory","details":{"url":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","actualSize":893091154,"maxSize":5000000}}
{"timestamp":"2024-09-02T14:19:33.248Z","logLevel":"info","context":"writer","message":"Rollover size exceeded, creating new WARC","details":{"size":1258552021,"oldFilename":"rec-36c67859a2e1-20240902141737592-0.warc.gz","newFilename":"rec-36c67859a2e1-20240902141933248-0.warc.gz","rolloverSize":1000000000,"id":"0"}}
{"timestamp":"2024-09-02T14:20:13.553Z","logLevel":"error","context":"browser","message":"Browser disconnected (crashed?), interrupting crawl","details":{}}
{"timestamp":"2024-09-02T14:20:13.554Z","logLevel":"warn","context":"recorder","message":"Failed to load response body","details":{"url":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","networkId":"7FFA1047E29E7008029AD4F1593A6C48","type":"exception","message":"Protocol error (Fetch.getResponseBody): Target closed","stack":"TargetCloseError: Protocol error (Fetch.getResponseBody): Target closed\n    at CallbackRegistry.clear (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:69:36)\n    at CdpCDPSession._onClosed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:98:25)\n    at #onClose (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:163:21)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/NodeWebSocketTransport.js:43:30)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)\n    at WebSocket.onClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:220:9)\n    at WebSocket.emit (node:events:519:28)\n    at WebSocket.emitClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:272:10)\n    at Socket.socketOnClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:1341:15)\n    at Socket.emit (node:events:519:28)","page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:20:13.554Z","logLevel":"error","context":"general","message":"Page Load Failed, skipping page","details":{"msg":"Protocol error (Page.navigate): Target closed","loadState":0,"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:20:13.609Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-09-02T14:20:13.754Z","logLevel":"info","context":"general","message":"Saving crawl state to: /output/.tmpo992w83h/collections/crawl-20240902123019192/crawls/crawl-20240902142013-36c67859a2e1.yaml","details":{}}
{"timestamp":"2024-09-02T14:20:13.760Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":2333,"total":14591,"pending":0,"failed":3,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-09-02T14:20:13.761Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-09-02T14:20:13.761Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}
```

It looks like the direct fetch times-out, which is not a surprise, since its value is 30s. Crawler then tries to download it "normally", but the a puppeteer disconnection occurs after some time.

Note that we have other big files where the direct download timed-out but the "normal crawl" achieved to proceed.

I do not reproduce the problem on my machine, so it is probably linked to some memory issue or whatever linked to the machine where the bug occurs.

For the record, the command I used to try to reproduce the problem (and big PDF unfortunately downloads properly):
```
docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.0-beta.0 crawl --url "https://www.survivorlibrary.com/index.php/Farming_Corn" --cwd /output --depth 1 --scopeType host
```

Even if we could say that we should run this crawl on another machine, I wonder if it wouldn't make more sense to be able to customize the `FETCH_TIMEOUT_SECS` with a CLI flag, so that the direct download does not fails just due to a timeout when we know that it has good reasons to take long to complete? Would it have any adverse side-effect (aside from risking that the crawl takes long time to detect a direct fetch which is really in timeout)?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Issue crawling a web property with big PDFs #676

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Issue crawling a web property with big PDFs #676

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions