Skip to content

PlaywrightCrawler error_handler cannot access Page object #1482

@janbuchar

Description

@janbuchar

In the Javascript version, the error handler is able to access the Page object via PlaywrightCrawlingContext.page. I discovered that the Python version doesn't implement this when porting the ContextPipeline to Javascript.

Test case

async def test_error_handler_can_access_page(server_url: URL) -> None:
    crawler = PlaywrightCrawler(max_request_retries=2)

    request_handler = mock.AsyncMock(side_effect=RuntimeError('Intentional crash'))
    crawler.router.default_handler(request_handler)

    error_handler_calls: list[str | None] = []

    @crawler.error_handler
    async def error_handler(context: BasicCrawlingContext | PlaywrightCrawlingContext, _error: Exception) -> None:
        error_handler_calls.append(
            await context.page.content() if isinstance(context, PlaywrightCrawlingContext) else None
        )

    await crawler.run([str(server_url / 'hello-world')])

    assert error_handler_calls == [HELLO_WORLD, HELLO_WORLD, HELLO_WORLD]

Possible solutions

  1. Run the error handlers before the cleanup step of the context pipeline
  2. Add some "deferred cleanup" step to the context pipeline and call that after error handlers are done
    • it's unclear how this would fit in the current async generator based middleware model
    • considerable refactoring of the _run_request_handler and __run_task_function would still be necessary - error handlers are called by the latter and context pipeline is only handled in the former

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions