Skip to content

Conversation

katrinleinweber
Copy link

"frequency": "Unclear at this time.",
"description": "Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot"
},
"Lightpanda": {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems they may be using a user agent designed to look like that of a browser:

const USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";

in which case there's not much we can do here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EASY: block: headless, it also block all oother scum headless scanners:)

Please raise an issue to discuss this as a new policy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EASY: block: headless, it also block all oother scum headless scanners:)

Hey Tina -

Is that a feature of robotstxt? Or some other way of doing it?

@krichprollsch
Copy link

Hello 👋

Disclaimer: Lightpanda's co-creator here.

I just want to clarify Ligthpanda doesn't scrape by itself. We don't provide scraping offer, but a browser which could be used to crawl. It's a browser/HTTP client as can be Curl or Chrome.

The project is very young and I don't think any of the issues with LLM/AI crawlers is due to Lightpanda.

So I would request to hold on this change at least until we see reported issues with Lightpanda.

My idea is to try to keep the user-agent hard coded to be visible for websites and easily blocked if websites want to.
But if the UA is blocked by default with this kind of generic list, the first thing user's will do is to hide them by changing the UA...

@glyn
Copy link
Contributor

glyn commented Mar 28, 2025

Hello 👋

Disclaimer: Lightpanda's co-creator here.

I just want to clarify Ligthpanda doesn't scrape by itself. We don't provide scraping offer, but a browser which could be used to crawl. It's a browser/HTTP client as can be Curl or Chrome.

The project is very young and I don't think any of the issues with LLM/AI crawlers is due to Lightpanda.

So I would request to hold on this change at least until we see reported issues with Lightpanda.

My idea is to try to keep the user-agent hard coded to be visible for websites and easily blocked if websites want to. But if the UA is blocked by default with this kind of generic list, the first thing user's will do is to hide them by changing the UA...

Makes sense. If we end up with a way (e.g. a script) to customise the robots.txt provided by this project, then individual users could choose to block additional UAs, including Lightpanda and derivatives.

@jamescgibson
Copy link

My idea is to try to keep the user-agent hard coded to be visible for websites and easily blocked if websites want to. But if the UA is blocked by default with this kind of generic list, the first thing user's will do is to hide them by changing the UA...

Why not hard code it to something that doesn't pretend to be a browser, then?

@krichprollsch
Copy link

Why not hard code it to something that doesn't pretend to be a browser, then?

Not sure I understand what you mean exactly.

The User-Agent HTTP header identifies the HTTP client.
We hard coded Lightpanda/1.0. We don't pretend to be another client/browser.

For example curl sends curl/8.5.0.

@jamescgibson
Copy link

Why not hard code it to something that doesn't pretend to be a browser, then?

Not sure I understand what you mean exactly.

The User-Agent HTTP header identifies the HTTP client. We hard coded Lightpanda/1.0. We don't pretend to be another client/browser.

For example curl sends curl/8.5.0.

Sorry, I'd seen glyn's comment above which I take it was from an earlier version before it was updated to "Lightpanda".

@katrinleinweber
Copy link
Author

@jamescgibson Only one of two USER_AGENT constants is currently set to something unique. Thus, I consider your comment above to remain valid.

But if the UA is blocked by default with this kind of generic list, the first thing user's will do is to hide them by changing the UA...

@krichprollsch True, but they might also reconsider what they're about to do.

@krichprollsch
Copy link

@katrinleinweber One precision: the Chrome hard-coded user agent [1] is used only in the CDP server implementation. We have to expose us as Chrome to CPD client's (Puppeteer, Playwright, ...) b/c many of them refuse to work with something else.
This user agent is not used to request websites, we use only the hard coded Lightpanda/1.0 [2].

But I understand it's confusing.

True, but they might also reconsider what they're about to do.

Sure, but I still think blocking identified usages by default doesn't give a chance to make respectful requests.
It forces requesters to hide themself by sending a Chrome UA and websites owner to use more in depth bot detection based on JS execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants