This project is a command-line application that allows you to crawl web pages starting from a given URL, retrieve documents, and search for specific text queries within those documents.
- Crawl and fetch documents from a specified starting URL.
- Search for a query text within the crawled documents.
- Configure the maximum number of documents to retrieve.
- Set a timeout for the crawling and query process.
- only support english language for tokenization !
Option | Description | Default Value |
---|---|---|
-s, --start-point <URL> |
Provide a valid URL to start fetching and crawling. | N/A |
-q, --query <QUERY TEXT> |
Text query to search within the documents and retrieve the related ones. | N/A |
-m, --max-doc <NUMBER> |
Maximum number of documents to retrieve. | 10 |
-t, --timeout <SECONDS> |
Specifies the timeout (in seconds) for processing the request. | 10 |
To run the program, use the following command:
./crawler -s https://example.com -q "sample query" -m 15 -t 20
This will:
- Start crawling from
https://example.com
. - Search for
"sample query"
within the crawled documents. - Retrieve up to 15 documents.
- Timeout after 20 seconds if the process takes too long.
download binary directly from the Release Page!
Or compile it by yourself :
-
Clone the repository:
git clone <repository-url>
-
Build the project:
cargo build --release
- If you want to compile it by yourself (ensure Rust is installed on your machine).
This project is licensed under the MIT License. See LICENSE
for more details.
Contributions are welcome! Feel free to submit a pull request or open an issue.
built with performance in mind:
- Tokio threadpools with channels between them to support I/O bound tasks.
- Rayon for CPU bound tasks.
- Database : Surrealdb.
All those to support scalability , maintainability and isolation of the main process !
Happy crawling!