Skip to content

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Sep 4, 2024

  • use '--timeout' value for direct fetch timeout, instead of fixed 30 seconds
  • don't consider 'document' as essential resource regardless of mime type, as any top-level URL is a document
  • don't count non-200 responses as non-essential even if missing content-type fixes Issue crawling a web property with big PDFs #676

- use '--timeout' value for direct fetch timeout, instead of fixed 30 seconds
- don't consider 'document' as essential resource regardless of mime type, as any top-level URL is a document
- don't count non-200 responses as non-essential even if missing content-type
fixes #676
@benoit74
Copy link
Contributor

benoit74 commented Sep 5, 2024

Thank you!

Few remarks:

  • I don't think that this.params.timeout is set,this.params.pageLoadTimeout seems to be the proper variable
  • don't you need to completely remove FETCH_TIMEOUT_SECS constant, and also update this.maxPageTime computation then?
  • I would consider to also remove SITEMAP_INITIAL_FETCH_TIMEOUT_SECS, and replace it with --timeout as well, I don't see a particular reason to use a constant value here for same reasons

@ikreymer
Copy link
Member Author

ikreymer commented Sep 5, 2024

  • I don't think that this.params.timeout is set,this.params.pageLoadTimeout seems to be the proper variable

They should both be available, but maybe more consistent to use just one since other is an alias.

  • don't you need to completely remove FETCH_TIMEOUT_SECS constant, and also update this.maxPageTime computation then?

There's a follow-up PR #678 which adds more refactoring - was thinking of replacing that constant to wait for initial headers load, in case the fetch() is stuck (maybe being blocked, detecting its not a browser etc...), so the full time is not used up if fetch() can't make any connection.

  • I would consider to also remove SITEMAP_INITIAL_FETCH_TIMEOUT_SECS, and replace it with --timeout as well, I don't see a particular reason to use a constant value here for same reasons

Hm, this is a little bit different, since the sitemap loading happens only once at the beginning of the crawl, and this is the initial time to wait for sitemap before continuing, eg. it will continue to load the first page while sitemap is still being parsed in the background. Maybe this can even be lower, no need to wait at all, as long as first seed is there, crawler can start...

@ikreymer
Copy link
Member Author

ikreymer commented Sep 5, 2024

@benoit74 Can you confirm that this fixes the issue you're having?

@benoit74
Copy link
Contributor

benoit74 commented Sep 5, 2024

Yes it works as intended, thank you !

@ikreymer ikreymer merged commit 0d6a0b0 into main Sep 5, 2024
4 checks passed
@ikreymer ikreymer deleted the simpler-fix-direct-fetch branch September 5, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue crawling a web property with big PDFs

3 participants