Skip to content

Issue crawling a web property with big PDFs #676

@benoit74

Description

@benoit74

We (Kiwix) are struggling to crawl https://www.survivorlibrary.com/index.php/main-library-index/.

Problem is linked to big files.

We have already setup include rule to include only PDFs and exclude ZIP files which are known to be too big, but we now realize there is even huge PDFs.

For instance, problem occurs when crawling https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf which is a ~837MB file.

Logs:

{"timestamp":"2024-09-02T14:18:25.817Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf"}}
{"timestamp":"2024-09-02T14:18:25.817Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":2332,"total":14591,"pending":1,"failed":2,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-09-02T14:18:25.673Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.survivorlibrary.com\\/library\\/corn_and_corn_improvement_1955.pdf\",\"added\":\"2024-09-02T12:33:36.057Z\",\"depth\":2}"]}}
{"timestamp":"2024-09-02T14:18:55.819Z","logLevel":"warn","context":"fetch","message":"Direct fetch capture attempt timed out","details":{"seconds":30,"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:18:55.820Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:18:58.268Z","logLevel":"warn","context":"recorder","message":"Large streamed written to WARC, but not returned to browser, requires reading into memory","details":{"url":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","actualSize":893091154,"maxSize":5000000}}
{"timestamp":"2024-09-02T14:19:33.248Z","logLevel":"info","context":"writer","message":"Rollover size exceeded, creating new WARC","details":{"size":1258552021,"oldFilename":"rec-36c67859a2e1-20240902141737592-0.warc.gz","newFilename":"rec-36c67859a2e1-20240902141933248-0.warc.gz","rolloverSize":1000000000,"id":"0"}}
{"timestamp":"2024-09-02T14:20:13.553Z","logLevel":"error","context":"browser","message":"Browser disconnected (crashed?), interrupting crawl","details":{}}
{"timestamp":"2024-09-02T14:20:13.554Z","logLevel":"warn","context":"recorder","message":"Failed to load response body","details":{"url":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","networkId":"7FFA1047E29E7008029AD4F1593A6C48","type":"exception","message":"Protocol error (Fetch.getResponseBody): Target closed","stack":"TargetCloseError: Protocol error (Fetch.getResponseBody): Target closed\n    at CallbackRegistry.clear (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:69:36)\n    at CdpCDPSession._onClosed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:98:25)\n    at #onClose (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:163:21)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/NodeWebSocketTransport.js:43:30)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)\n    at WebSocket.onClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:220:9)\n    at WebSocket.emit (node:events:519:28)\n    at WebSocket.emitClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:272:10)\n    at Socket.socketOnClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:1341:15)\n    at Socket.emit (node:events:519:28)","page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:20:13.554Z","logLevel":"error","context":"general","message":"Page Load Failed, skipping page","details":{"msg":"Protocol error (Page.navigate): Target closed","loadState":0,"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:20:13.609Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-09-02T14:20:13.754Z","logLevel":"info","context":"general","message":"Saving crawl state to: /output/.tmpo992w83h/collections/crawl-20240902123019192/crawls/crawl-20240902142013-36c67859a2e1.yaml","details":{}}
{"timestamp":"2024-09-02T14:20:13.760Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":2333,"total":14591,"pending":0,"failed":3,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-09-02T14:20:13.761Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-09-02T14:20:13.761Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}

It looks like the direct fetch times-out, which is not a surprise, since its value is 30s. Crawler then tries to download it "normally", but the a puppeteer disconnection occurs after some time.

Note that we have other big files where the direct download timed-out but the "normal crawl" achieved to proceed.

I do not reproduce the problem on my machine, so it is probably linked to some memory issue or whatever linked to the machine where the bug occurs.

For the record, the command I used to try to reproduce the problem (and big PDF unfortunately downloads properly):

docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.0-beta.0 crawl --url "https://www.survivorlibrary.com/index.php/Farming_Corn" --cwd /output --depth 1 --scopeType host

Even if we could say that we should run this crawl on another machine, I wonder if it wouldn't make more sense to be able to customize the FETCH_TIMEOUT_SECS with a CLI flag, so that the direct download does not fails just due to a timeout when we know that it has good reasons to take long to complete? Would it have any adverse side-effect (aside from risking that the crawl takes long time to detect a direct fetch which is really in timeout)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done!

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions