Add Lightpanda due to its AI/LLM focus #95

katrinleinweber · 2025-03-27T17:01:33Z

See https://github.com/lightpanda-io/browser for details.

https://github.com/lightpanda-io/browser

katrinleinweber · 2025-03-27T17:03:29Z

robots.json

        "frequency": "Unclear at this time.",
        "description": "Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot"
    },
+    "Lightpanda": {


Got this from https://github.com/lightpanda-io/browser/blob/82e67b7550629da49e83bfb8c0100dce538c0009/src/browser/browser.zig#L53

However, the user agent might be different and change given that TODO comment.

Yeah, it seems they may be using a user agent designed to look like that of a browser:

const USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";

in which case there's not much we can do here.

EASY: block: headless, it also block all oother scum headless scanners:)

Please raise an issue to discuss this as a new policy.

EASY: block: headless, it also block all oother scum headless scanners:)

Hey Tina -

Is that a feature of robotstxt? Or some other way of doing it?

krichprollsch · 2025-03-28T08:46:13Z

Hello 👋

Disclaimer: Lightpanda's co-creator here.

I just want to clarify Ligthpanda doesn't scrape by itself. We don't provide scraping offer, but a browser which could be used to crawl. It's a browser/HTTP client as can be Curl or Chrome.

The project is very young and I don't think any of the issues with LLM/AI crawlers is due to Lightpanda.

So I would request to hold on this change at least until we see reported issues with Lightpanda.

My idea is to try to keep the user-agent hard coded to be visible for websites and easily blocked if websites want to.
But if the UA is blocked by default with this kind of generic list, the first thing user's will do is to hide them by changing the UA...

glyn · 2025-03-28T08:55:18Z

Hello 👋

Disclaimer: Lightpanda's co-creator here.

I just want to clarify Ligthpanda doesn't scrape by itself. We don't provide scraping offer, but a browser which could be used to crawl. It's a browser/HTTP client as can be Curl or Chrome.

The project is very young and I don't think any of the issues with LLM/AI crawlers is due to Lightpanda.

So I would request to hold on this change at least until we see reported issues with Lightpanda.

My idea is to try to keep the user-agent hard coded to be visible for websites and easily blocked if websites want to. But if the UA is blocked by default with this kind of generic list, the first thing user's will do is to hide them by changing the UA...

Makes sense. If we end up with a way (e.g. a script) to customise the robots.txt provided by this project, then individual users could choose to block additional UAs, including Lightpanda and derivatives.

jamescgibson · 2025-04-17T17:09:18Z

My idea is to try to keep the user-agent hard coded to be visible for websites and easily blocked if websites want to. But if the UA is blocked by default with this kind of generic list, the first thing user's will do is to hide them by changing the UA...

Why not hard code it to something that doesn't pretend to be a browser, then?

krichprollsch · 2025-04-23T07:35:02Z

Why not hard code it to something that doesn't pretend to be a browser, then?

Not sure I understand what you mean exactly.

The User-Agent HTTP header identifies the HTTP client.
We hard coded Lightpanda/1.0. We don't pretend to be another client/browser.

For example curl sends curl/8.5.0.

jamescgibson · 2025-04-23T13:58:19Z

Why not hard code it to something that doesn't pretend to be a browser, then?

Not sure I understand what you mean exactly.

The User-Agent HTTP header identifies the HTTP client. We hard coded Lightpanda/1.0. We don't pretend to be another client/browser.

For example curl sends curl/8.5.0.

Sorry, I'd seen glyn's comment above which I take it was from an earlier version before it was updated to "Lightpanda".

katrinleinweber · 2025-04-29T19:30:32Z

@jamescgibson Only one of two USER_AGENT constants is currently set to something unique. Thus, I consider your comment above to remain valid.

But if the UA is blocked by default with this kind of generic list, the first thing user's will do is to hide them by changing the UA...

@krichprollsch True, but they might also reconsider what they're about to do.

krichprollsch · 2025-04-30T17:05:42Z

@katrinleinweber One precision: the Chrome hard-coded user agent [1] is used only in the CDP server implementation. We have to expose us as Chrome to CPD client's (Puppeteer, Playwright, ...) b/c many of them refuse to work with something else.
This user agent is not used to request websites, we use only the hard coded Lightpanda/1.0 [2].

But I understand it's confusing.

True, but they might also reconsider what they're about to do.

Sure, but I still think blocking identified usages by default doesn't give a chance to make respectful requests.
It forces requesters to hide themself by sending a Chrome UA and websites owner to use more in depth bot detection based on JS execution.

Add Lightpanda due to its AI/LLM focus

d79ca19

https://github.com/lightpanda-io/browser

katrinleinweber commented Mar 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Lightpanda due to its AI/LLM focus #95

Add Lightpanda due to its AI/LLM focus #95

Uh oh!

katrinleinweber commented Mar 27, 2025

Uh oh!

katrinleinweber Mar 27, 2025

Uh oh!

glyn Mar 27, 2025

Uh oh!

glyn Mar 28, 2025

Uh oh!

jamescgibson Apr 17, 2025

Uh oh!

krichprollsch commented Mar 28, 2025

Uh oh!

glyn commented Mar 28, 2025

Uh oh!

jamescgibson commented Apr 17, 2025

Uh oh!

krichprollsch commented Apr 23, 2025

Uh oh!

jamescgibson commented Apr 23, 2025

Uh oh!

katrinleinweber commented Apr 29, 2025

Uh oh!

krichprollsch commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add Lightpanda due to its AI/LLM focus #95

Are you sure you want to change the base?

Add Lightpanda due to its AI/LLM focus #95

Uh oh!

Conversation

katrinleinweber commented Mar 27, 2025

Uh oh!

katrinleinweber Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

glyn Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

glyn Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

jamescgibson Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

krichprollsch commented Mar 28, 2025

Uh oh!

glyn commented Mar 28, 2025

Uh oh!

jamescgibson commented Apr 17, 2025

Uh oh!

krichprollsch commented Apr 23, 2025

Uh oh!

jamescgibson commented Apr 23, 2025

Uh oh!

katrinleinweber commented Apr 29, 2025

Uh oh!

krichprollsch commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants