crawler

Web Crawler and Query Tool

This project is a command-line application that allows you to crawl web pages starting from a given URL, retrieve documents, and search for specific text queries within those documents.

Features

Crawl and fetch documents from a specified starting URL.
Search for a query text within the crawled documents.
Configure the maximum number of documents to retrieve.
Set a timeout for the crawling and query process.
only support english language for tokenization !

Usage

Command-Line Options

Option	Description	Default Value
`-s, --start-point <URL>`	Provide a valid URL to start fetching and crawling.	N/A
`-q, --query <QUERY TEXT>`	Text query to search within the documents and retrieve the related ones.	N/A
`-m, --max-doc <NUMBER>`	Maximum number of documents to retrieve.	10
`-t, --timeout <SECONDS>`	Specifies the timeout (in seconds) for processing the request.	10

Example

To run the program, use the following command:

./crawler -s https://example.com -q "sample query" -m 15 -t 20

This will:

Start crawling from https://example.com.
Search for "sample query" within the crawled documents.
Retrieve up to 15 documents.
Timeout after 20 seconds if the process takes too long.

Installation

download binary directly from the Release Page!

Or compile it by yourself :

Clone the repository:
```
git clone <repository-url>
```
Build the project:
```
cargo build --release
```

Requirements

If you want to compile it by yourself (ensure Rust is installed on your machine).

License

This project is licensed under the MIT License. See LICENSE for more details.

Contributing

Contributions are welcome! Feel free to submit a pull request or open an issue.

BLUEPRINT

built with performance in mind:

Tokio threadpools with channels between them to support I/O bound tasks.
Rayon for CPU bound tasks.
Database : Surrealdb.

All those to support scalability , maintainability and isolation of the main process !

Happy crawling!

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github		.github
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
blueprint.svg		blueprint.svg
cliff.toml		cliff.toml
dist-workspace.toml		dist-workspace.toml
out.txt		out.txt
release-plz.toml		release-plz.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

crawler

Web Crawler and Query Tool

Features

Usage

Command-Line Options

Example

Installation

Requirements

License

Contributing

BLUEPRINT

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

License

ALAWIII/crawler

Folders and files

Latest commit

History

Repository files navigation

crawler

Web Crawler and Query Tool

Features

Usage

Command-Line Options

Example

Installation

Requirements

License

Contributing

BLUEPRINT

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages