|
1 |
| -# Concurrent Web Crawler |
| 1 | +# 🚀 Concurrent Web Crawler |
2 | 2 |
|
3 |
| -## Introduction |
4 |
| -The Concurrent Web Crawler is a Go-based application designed to crawl web pages efficiently using Go's powerful concurrency features. This project was undertaken to deepen my understanding of Go's concurrency model, specifically Goroutines, Channels, and synchronization primitives. |
| 3 | + |
| 4 | + |
| 5 | + |
| 6 | + |
5 | 7 |
|
6 |
| -## Objectives |
7 |
| -- **Learn Go's Concurrency Model**: Gain hands-on experience with Goroutines and Channels. |
8 |
| -- **Implement Synchronization**: Understand and apply synchronization tools like Mutexes and WaitGroups. |
9 |
| -- **Build a Scalable Application**: Create a web crawler that can efficiently handle multiple tasks concurrently. |
10 |
| -- **Respect Web Crawling Ethics**: Implement features to respect robots.txt and avoid overloading servers. |
| 8 | +## Overview |
| 9 | + |
| 10 | +The Concurrent Web Crawler is a Go-based application that efficiently crawls web pages. By leveraging Go's concurrency features, this tool provides fast and effective web scraping. Whether you want to gather data or analyze web content, this crawler is designed for performance and reliability. |
11 | 11 |
|
12 | 12 | ## Features
|
13 |
| -- **Concurrent Crawling**: Utilizes Goroutines to fetch and parse web pages concurrently. |
14 |
| -- **URL Parsing and Normalization**: Extracts and normalizes links from HTML content. |
15 |
| -- **Visited URLs Tracking**: Keeps track of visited URLs to prevent duplicate processing. |
16 |
| -- **Robots.txt Compliance**: Checks and respects robots.txt directives for each domain. |
17 |
| -- **Rate Limiting**: Implements rate limiting to prevent overwhelming web servers. |
18 |
| -- **Graceful Shutdown**: Handles system signals to ensure the crawler can be stopped gracefully. |
19 |
| - |
20 |
| -## Project Structure |
| 13 | + |
| 14 | +- **Concurrency**: Utilize Go's goroutines for fast processing. |
| 15 | +- **Rate Limiting**: Control the number of requests sent to avoid overwhelming servers. |
| 16 | +- **Error Handling**: Robust mechanisms to manage failures during crawling. |
| 17 | +- **HTML Parsing**: Extract meaningful data from web pages. |
| 18 | +- **Channel-Based Communication**: Efficient data flow and management. |
| 19 | + |
| 20 | +## Getting Started |
| 21 | + |
| 22 | +### Prerequisites |
| 23 | + |
| 24 | +Before you start, ensure you have Go installed on your system. You can download it from the [official Go website](https://golang.org/dl/). |
| 25 | + |
| 26 | +### Installation |
| 27 | + |
| 28 | +1. Clone the repository: |
| 29 | + ```bash |
| 30 | + git clone https://github.com/LOKESH-loky/Concurrent-Web-Crawler.git |
| 31 | + cd Concurrent-Web-Crawler |
| 32 | + ``` |
| 33 | + |
| 34 | +2. Build the application: |
| 35 | + ```bash |
| 36 | + go build -o webcrawler |
| 37 | + ``` |
| 38 | + |
| 39 | +3. Run the application: |
| 40 | + ```bash |
| 41 | + ./webcrawler |
| 42 | + ``` |
| 43 | + |
| 44 | +### Configuration |
| 45 | + |
| 46 | +The application supports various configuration options. You can adjust the following parameters in the `config.yaml` file: |
| 47 | + |
| 48 | +- `maxDepth`: Set the maximum depth for crawling. |
| 49 | +- `maxUrls`: Limit the number of URLs to visit. |
| 50 | +- `rateLimit`: Control the number of requests per second. |
| 51 | + |
| 52 | +Example configuration: |
| 53 | +```yaml |
| 54 | +maxDepth: 3 |
| 55 | +maxUrls: 100 |
| 56 | +rateLimit: 10 |
21 | 57 | ```
|
22 |
| -concurrent-web-crawler/ |
23 |
| -├── fetch.go |
24 |
| -├── go.mod |
25 |
| -├── go.sum |
26 |
| -├── main.go |
27 |
| -├── parse.go |
28 |
| -├── README.md |
29 |
| -└── robots.go |
| 58 | +
|
| 59 | +### Running the Crawler |
| 60 | +
|
| 61 | +To start crawling, run the command: |
| 62 | +```bash |
| 63 | +./webcrawler -url <start_url> |
30 | 64 | ```
|
| 65 | +Replace `<start_url>` with the target URL you want to crawl. |
31 | 66 |
|
32 |
| -## Installation and Usage |
33 |
| -### Installation |
34 |
| -#### Clone the Repository |
35 |
| -```sh |
36 |
| -git clone https://github.com/0ritam/Concurrent-Web-Crawler.git |
37 |
| -cd concurrent-web-crawler |
| 67 | +### Output |
| 68 | + |
| 69 | +The crawler outputs the results in a structured format. You can specify the output format using command-line flags: |
| 70 | +- `-json`: Outputs in JSON format. |
| 71 | +- `-csv`: Outputs in CSV format. |
| 72 | + |
| 73 | +### Example |
| 74 | + |
| 75 | +```bash |
| 76 | +./webcrawler -url https://example.com -json |
38 | 77 | ```
|
39 | 78 |
|
40 |
| -#### Install Dependencies |
41 |
| -```sh |
42 |
| -go mod download |
| 79 | +## Advanced Usage |
| 80 | + |
| 81 | +### Concurrency Control |
| 82 | + |
| 83 | +The crawler allows you to control the number of concurrent requests. This is managed through the `concurrency` parameter in the command line: |
| 84 | +```bash |
| 85 | +./webcrawler -url https://example.com -concurrency 5 |
43 | 86 | ```
|
| 87 | +Adjust this number based on the target server's capabilities and your needs. |
44 | 88 |
|
45 |
| -### Running the Crawler |
46 |
| -```sh |
47 |
| -go run . |
| 89 | +### Custom User Agent |
| 90 | + |
| 91 | +To avoid blocking, set a custom User-Agent in the `config.yaml`: |
| 92 | +```yaml |
| 93 | +userAgent: "MyCustomCrawler/1.0" |
48 | 94 | ```
|
49 | 95 |
|
50 |
| -## What I Learned |
51 |
| -### Go's Concurrency Model |
52 |
| -- **Goroutines**: Learned how to launch lightweight threads using Goroutines, allowing functions to run concurrently. |
53 |
| -- **Channels**: Understood how to use Channels for communication between Goroutines, enabling safe data transfer without explicit locking. |
| 96 | +## Error Handling |
| 97 | +
|
| 98 | +The crawler includes built-in error handling. It will log errors and continue processing remaining URLs. You can find logs in the `logs` directory. |
| 99 | + |
| 100 | +## Testing |
| 101 | + |
| 102 | +To run tests, use the following command: |
| 103 | +```bash |
| 104 | +go test ./... |
| 105 | +``` |
| 106 | +Make sure to review and run tests before deploying. |
| 107 | + |
| 108 | +## Contributing |
| 109 | + |
| 110 | +Contributions are welcome! Here’s how you can help: |
| 111 | + |
| 112 | +1. Fork the repository. |
| 113 | +2. Create a new branch: |
| 114 | + ```bash |
| 115 | + git checkout -b feature/YourFeature |
| 116 | + ``` |
| 117 | +3. Make your changes. |
| 118 | +4. Commit your changes: |
| 119 | + ```bash |
| 120 | + git commit -m "Add new feature" |
| 121 | + ``` |
| 122 | +5. Push to the branch: |
| 123 | + ```bash |
| 124 | + git push origin feature/YourFeature |
| 125 | + ``` |
| 126 | +6. Create a pull request. |
| 127 | + |
| 128 | +Please ensure your code follows the existing style and includes tests where appropriate. |
| 129 | + |
| 130 | +## License |
| 131 | + |
| 132 | +This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. |
| 133 | + |
| 134 | +## Releases |
| 135 | + |
| 136 | +For the latest versions and updates, please visit the [Releases](https://github.com/LOKESH-loky/Concurrent-Web-Crawler/releases) section. |
| 137 | + |
| 138 | +## Acknowledgments |
54 | 139 |
|
55 |
| -### Synchronization Primitives |
56 |
| -- **WaitGroups**: Used `sync.WaitGroup` to wait for a collection of Goroutines to finish executing. |
57 |
| -- **Mutexes**: Applied `sync.Mutex` to protect shared resources and prevent race conditions. |
| 140 | +- Thanks to the Go community for their contributions. |
| 141 | +- Inspired by various open-source web crawling projects. |
58 | 142 |
|
59 |
| -### Web Crawling Techniques |
60 |
| -- **HTTP Requests**: Gained experience with the `net/http` package to perform HTTP requests and handle responses. |
61 |
| -- **HTML Parsing**: Utilized the `goquery` library to parse HTML documents and extract links efficiently. |
62 |
| -- **URL Normalization**: Learned to normalize and resolve relative URLs to absolute URLs, ensuring accurate crawling paths. |
| 143 | +## Contact |
63 | 144 |
|
64 |
| -### Ethical Crawling Practices |
65 |
| -- **Robots.txt Compliance**: Implemented functionality to read and respect `robots.txt` files, adhering to website crawling policies. |
66 |
| -- **Rate Limiting**: Introduced rate limiting to control the frequency of HTTP requests, preventing server overloads. |
67 |
| -- **Domain Restrictions**: Limited crawling to specific domains to avoid unintended crawling of external sites. |
| 145 | +For questions or suggestions, open an issue on GitHub or contact me directly through my profile. |
68 | 146 |
|
69 |
| -### Error Handling and Logging |
70 |
| -- **Robust Error Handling**: Developed strategies for retrying failed requests and handling various types of errors gracefully. |
71 |
| -- **Logging**: Employed the `log` package to record significant events and errors, aiding in debugging and monitoring. |
| 147 | +--- |
72 | 148 |
|
73 |
| -## Conclusion |
74 |
| -This project provided a comprehensive exploration of Go's concurrency features and practical application in building a real-world tool. By developing the Concurrent Web Crawler, I not only enhanced my technical skills but also gained valuable insights into software design, ethical considerations, and best practices in Go programming. |
| 149 | +This README provides a complete overview of the Concurrent Web Crawler. Feel free to explore, contribute, and use this powerful tool for your web crawling needs! |
0 commit comments