-
Notifications
You must be signed in to change notification settings - Fork 492
Open
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Description
In the Javascript version, the error handler is able to access the Page
object via PlaywrightCrawlingContext.page
. I discovered that the Python version doesn't implement this when porting the ContextPipeline
to Javascript.
Test case
async def test_error_handler_can_access_page(server_url: URL) -> None:
crawler = PlaywrightCrawler(max_request_retries=2)
request_handler = mock.AsyncMock(side_effect=RuntimeError('Intentional crash'))
crawler.router.default_handler(request_handler)
error_handler_calls: list[str | None] = []
@crawler.error_handler
async def error_handler(context: BasicCrawlingContext | PlaywrightCrawlingContext, _error: Exception) -> None:
error_handler_calls.append(
await context.page.content() if isinstance(context, PlaywrightCrawlingContext) else None
)
await crawler.run([str(server_url / 'hello-world')])
assert error_handler_calls == [HELLO_WORLD, HELLO_WORLD, HELLO_WORLD]
Possible solutions
- Run the error handlers before the cleanup step of the context pipeline
- this is a fairly big change and we probably want to do it after fix: Only apply requestHandlerTimeout to request handler #1474
- changing this in the adaptive playwright crawler will be especially tricky
- Add some "deferred cleanup" step to the context pipeline and call that after error handlers are done
- it's unclear how this would fit in the current async generator based middleware model
- considerable refactoring of the
_run_request_handler
and__run_task_function
would still be necessary - error handlers are called by the latter and context pipeline is only handled in the former
Metadata
Metadata
Assignees
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.