Skip to content

Bug? PlaywrightCrawler enqueueLinks fails after WWW redirect. #2513

@obsidience

Description

@obsidience

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Hi all,

Is the following a bug? I'm noticing that context.enqueueLinks seems to fail if the URL browsed has a WWW redirect. When this occurs, it looks like the selector succeeds to extract URL's however there's a "createFilteredRequests" call within enqueue_links.js that uses a enqueueStrategyPattern of "{glob: 'http{s,}://domain.com/**'}" and, because the glob doesn't have a WWW prefix, it fails.

It looks like this may be caused by enqueue_links.js resolveBaseUrlForEnqueueLinksFiltering() assuming that the sanest option would be to assume "same origin", but wouldn't using "same domain" be more sane for a typical crawler as http->https and www redirects are common?

Thanks for your help!

Code sample

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
	async requestHandler(context) {
		await context.enqueueLinks({
			selector: 'a[slot="full-post-link"]', // fails
			//globs: ['**/comments/**'], // succeeds
		});
	},
	headless: false,
	launchContext: {
		launchOptions: {
			slowMo: 500,
		},
	},
});

await crawler.run(['https://reddit.com/r/legal']); // note: this is missing "www."

Package version

3.10.2

Node.js version

20.13.1

Operating system

Win11

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions