-
Notifications
You must be signed in to change notification settings - Fork 126
Add Lightpanda due to its AI/LLM focus #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
"frequency": "Unclear at this time.", | ||
"description": "Kangaroo Bot is used by the company Kangaroo LLM to download data to train AI models tailored to Australian language and culture. More info can be found at https://darkvisitors.com/agents/agents/kangaroo-bot" | ||
}, | ||
"Lightpanda": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got this from https://github.com/lightpanda-io/browser/blob/82e67b7550629da49e83bfb8c0100dce538c0009/src/browser/browser.zig#L53
However, the user agent might be different and change given that TODO
comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems they may be using a user agent designed to look like that of a browser:
const USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";
in which case there's not much we can do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EASY: block: headless, it also block all oother scum headless scanners:)
Please raise an issue to discuss this as a new policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EASY: block: headless, it also block all oother scum headless scanners:)
Hey Tina -
Is that a feature of robotstxt? Or some other way of doing it?
Hello 👋 Disclaimer: Lightpanda's co-creator here. I just want to clarify Ligthpanda doesn't scrape by itself. We don't provide scraping offer, but a browser which could be used to crawl. It's a browser/HTTP client as can be Curl or Chrome. The project is very young and I don't think any of the issues with LLM/AI crawlers is due to Lightpanda. So I would request to hold on this change at least until we see reported issues with Lightpanda. My idea is to try to keep the user-agent hard coded to be visible for websites and easily blocked if websites want to. |
Makes sense. If we end up with a way (e.g. a script) to customise the robots.txt provided by this project, then individual users could choose to block additional UAs, including Lightpanda and derivatives. |
Why not hard code it to something that doesn't pretend to be a browser, then? |
Not sure I understand what you mean exactly. The For example curl sends |
Sorry, I'd seen glyn's comment above which I take it was from an earlier version before it was updated to "Lightpanda". |
@jamescgibson Only one of two
@krichprollsch True, but they might also reconsider what they're about to do. |
@katrinleinweber One precision: the Chrome hard-coded user agent [1] is used only in the CDP server implementation. We have to expose us as Chrome to CPD client's (Puppeteer, Playwright, ...) b/c many of them refuse to work with something else. But I understand it's confusing.
Sure, but I still think blocking identified usages by default doesn't give a chance to make respectful requests. |
See https://github.com/lightpanda-io/browser for details.