Skip to content

List of well-known bots and user-agent patterns to detect them

License

arcjet/well-known-bots

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Well Known Bots

This repository contains a list of Well Known Bots, including robots, crawlers, validators, monitors, and spiders, in a single JSON file. Each bot is identified and provided a RegExp pattern to match against an HTTP User-Agent header. Additional metadata is available on each item.

Install

Direct download

Download the well-known-bots.json file directly.

Realities

It's impossible to create a system that can detect all bots. Well-behaving bots identify themselves in a consistent manner, usually via the User-Agent patterns this project provides. It is straightforward to identify these well-behaving bots, but misbehaving bots pretend to be real clients and use various mechanisms to evade detection.

For more details, see Non-Technical Notes in the browser-fingerprinting project.

Custom bots

To block a particular bot that is not on this list, you can use an Arcjet filter. See the Malicious traffic blueprint for how to block custom bots.

Structure

Each entry in the JSON represents a specific bot or crawler and includes the following fields:

  • id: A unique identifier for the bot
  • categories: An array of categories the bot belongs to (e.g., "search-engine", "advertising")
  • pattern: A regular expression pattern used to identify the bot in user agent strings
  • url: (optional) A URL with more information about the bot
  • verification: A list of supported methods for verifying the bot's identity (if the bot is not verifiable it should be empty).
  • instances: An array of example user agent strings for the bot
  • aliases: Extra unique identifiers for the bot that can be used to identify it across other data sources

Verification

Each verification entry contains the following fields:

  • type: The method of verification (dns and cidr are supported)

If you specify dns verification then these fields are expected:

  • masks: An array of mask patterns used for verification

If you specify cidr verification then these fields are expected:

  • sources: An array of sources to pull cidr range data from (at least one is required)

Verification mask patterns

The mask patterns use the following special characters:

  • *: Represents 0 or 1 of any character
  • @: Acts as a wildcard, matching any number of characters

All other characters in the mask require an exact match.

Cidr verification sources

Each cidr source requires the following fields:

  • type: The type of source (Currently only http-json) is supported
  • url: The url that hosts the ip ranges
  • selector: A JsonPath selector that selects all of the IP ranges in the source

License

The project is a hard-fork of crawler-user-agents at commit 46831767324e10c69c9ac6e538c9847853a0feb9, which is distributed under the MIT License.

About

List of well-known bots and user-agent patterns to detect them

Resources

License

Security policy

Stars

Watchers

Forks