fix for direct fetch timeouts #677

ikreymer · 2024-09-04T23:08:58Z

use '--timeout' value for direct fetch timeout, instead of fixed 30 seconds
don't consider 'document' as essential resource regardless of mime type, as any top-level URL is a document
don't count non-200 responses as non-essential even if missing content-type fixes Issue crawling a web property with big PDFs #676

- use '--timeout' value for direct fetch timeout, instead of fixed 30 seconds - don't consider 'document' as essential resource regardless of mime type, as any top-level URL is a document - don't count non-200 responses as non-essential even if missing content-type fixes #676

benoit74 · 2024-09-05T06:04:25Z

Thank you!

Few remarks:

I don't think that this.params.timeout is set,this.params.pageLoadTimeout seems to be the proper variable
don't you need to completely remove FETCH_TIMEOUT_SECS constant, and also update this.maxPageTime computation then?
I would consider to also remove SITEMAP_INITIAL_FETCH_TIMEOUT_SECS, and replace it with --timeout as well, I don't see a particular reason to use a constant value here for same reasons

ikreymer · 2024-09-05T06:28:07Z

I don't think that this.params.timeout is set,this.params.pageLoadTimeout seems to be the proper variable

They should both be available, but maybe more consistent to use just one since other is an alias.

don't you need to completely remove FETCH_TIMEOUT_SECS constant, and also update this.maxPageTime computation then?

There's a follow-up PR #678 which adds more refactoring - was thinking of replacing that constant to wait for initial headers load, in case the fetch() is stuck (maybe being blocked, detecting its not a browser etc...), so the full time is not used up if fetch() can't make any connection.

I would consider to also remove SITEMAP_INITIAL_FETCH_TIMEOUT_SECS, and replace it with --timeout as well, I don't see a particular reason to use a constant value here for same reasons

Hm, this is a little bit different, since the sitemap loading happens only once at the beginning of the crawl, and this is the initial time to wait for sitemap before continuing, eg. it will continue to load the first page while sitemap is still being parsed in the background. Maybe this can even be lower, no need to wait at all, as long as first seed is there, crawler can start...

ikreymer · 2024-09-05T06:28:43Z

@benoit74 Can you confirm that this fixes the issue you're having?

benoit74 · 2024-09-05T06:51:49Z

Yes it works as intended, thank you !

src/crawler.ts

ikreymer requested a review from tw4l September 5, 2024 05:53

ikreymer mentioned this pull request Sep 5, 2024

Issue crawling a web property with big PDFs #676

Closed

ikreymer mentioned this pull request Sep 5, 2024

Additional direct fetch improvements #678

Merged

tw4l approved these changes Sep 5, 2024

View reviewed changes

tw4l reviewed Sep 5, 2024

View reviewed changes

src/crawler.ts Outdated Show resolved Hide resolved

update to use pageLoadTimeout

ddff465

ikreymer merged commit 0d6a0b0 into main Sep 5, 2024
4 checks passed

ikreymer deleted the simpler-fix-direct-fetch branch September 5, 2024 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix for direct fetch timeouts #677

fix for direct fetch timeouts #677

Uh oh!

ikreymer commented Sep 4, 2024

Uh oh!

benoit74 commented Sep 5, 2024

Uh oh!

ikreymer commented Sep 5, 2024

Uh oh!

ikreymer commented Sep 5, 2024

Uh oh!

benoit74 commented Sep 5, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix for direct fetch timeouts #677

fix for direct fetch timeouts #677

Uh oh!

Conversation

ikreymer commented Sep 4, 2024

Uh oh!

benoit74 commented Sep 5, 2024

Uh oh!

ikreymer commented Sep 5, 2024

Uh oh!

ikreymer commented Sep 5, 2024

Uh oh!

benoit74 commented Sep 5, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants