This repository contains a list of Well Known Bots, including robots, crawlers,
validators, monitors, and spiders, in a single JSON file. Each bot is identified
and provided a RegExp pattern to match against an HTTP User-Agent header.
Additional metadata is available on each item.
Download the well-known-bots.json file directly.
It's impossible to create a system that can detect all bots. Well-behaving bots identify themselves in a consistent manner, usually via the User-Agent patterns this project provides. It is straightforward to identify these well-behaving bots, but misbehaving bots pretend to be real clients and use various mechanisms to evade detection.
For more details, see Non-Technical Notes in the browser-fingerprinting project.
To block a particular bot that is not on this list, you can use an Arcjet filter. See the Malicious traffic blueprint for how to block custom bots.
Each entry in the JSON represents a specific bot or crawler and includes the following fields:
- id: A unique identifier for the bot
- categories: An array of categories the bot belongs to (e.g., "search-engine", "advertising")
- pattern: A regular expression pattern used to identify the bot in user agent strings
- url: (optional) A URL with more information about the bot
- verification: A list of supported methods for verifying the bot's identity (if the bot is not verifiable it should be empty).
- instances: An array of example user agent strings for the bot
- aliases: Extra unique identifiers for the bot that can be used to identify it across other data sources
Each verification entry contains the following fields:
- type: The method of verification (
dnsandcidrare supported)
If you specify dns verification then these fields are expected:
- masks: An array of mask patterns used for verification
If you specify cidr verification then these fields are expected:
- sources: An array of sources to pull cidr range data from (at least one is required)
The mask patterns use the following special characters:
- *: Represents 0 or 1 of any character
- @: Acts as a wildcard, matching any number of characters
All other characters in the mask require an exact match.
Each cidr source requires the following fields:
- type: The type of source (Currently only
http-json) is supported - url: The url that hosts the ip ranges
- selector: A JsonPath selector that selects all of the IP ranges in the source
The project is a hard-fork of crawler-user-agents at commit
46831767324e10c69c9ac6e538c9847853a0feb9, which is distributed under the MIT
License.