Skip to content

How the crawler works

Sebastian Zimmeck edited this page Apr 17, 2025 · 2 revisions

At a very high level, the crawler is a containerized application orchestrated with Docker Compose. It spins up a REST API, a database, and several other containers that work together to visit websites and extract structured data. In what follows, we will explain the architecture of the crawler in a Q&A format. To learn more about Docker, see the official documentation here: https://docs.docker.com/get-started/docker-overview/

Q1. What are the main components of the crawler?

The main components of the crawler correspond to individual Docker services defined in the docker-compose.yml file at the root of the repository. Each one is responsible for a distinct part of the crawling pipeline. When you start the compose project, the containers are started in the following order:

  1. mariadb and crawl_browser are started first.

    • mariadb is a MySQL-compatible database that all other services write to or read from.
    • crawl_browser is a Selenium Firefox container, which acts as the controlled browser instance for interacting with websites.
  2. phpmyadmin and rest_api come up next.
    Once the database is healthy, the REST API (rest_api) and the admin interface (phpmyadmin) are started. These services allow other parts of the system (and users) to read from and write to the database.

  3. crawl_driver is started once the above services are healthy.
    The main crawl driver (crawl_driver) is launched. This service uses the managed browser to visit websites, collect data, and interact with the REST API and database.

  4. well_known_crawl is a final service that runs after the main crawl is complete - when the crawl_driver container shuts down.
    After the main crawl completes, a separate container (well_known_crawl) runs to probe .well-known/gpc.json files across domains.

Q2 How do the containers "talk" to each other?

In the compose.yml file, we define a shared network called mariadb_network:

networks:
  mariadb_network:
    driver: bridge

Each container that needs to communicate with others includes the following lines in its configuration:

networks:
  - mariadb_network

This tells Docker to connect the container to the shared network. Once connected, containers can communicate with each other using their service names as hostnames. For example, the crawl_driver container can send requests to rest_api:8080, and the rest_api container can connect to the database using the hostname mariadb.

Because all services are on the same internal network, Docker handles service discovery automatically — no manual IP configuration is needed. This setup ensures all containers can talk to each other reliably and securely within the isolated Docker environment.

To learn more about Docker networking, see the official documentation https://docs.docker.com/network/

Clone this wiki locally