Prothom Alo News Crawler 📰

Overview

The Prothom Alo News Crawler is a Python-based web scraping tool designed to fetch and parse news articles from the Prothom Alo website. It uses Selenium for dynamic content loading, BeautifulSoup for HTML parsing, and Elasticsearch for storing the scraped data.

Features

Dynamic Content Loading: Utilizes Selenium to handle JavaScript-rendered content.
HTML Parsing: Uses BeautifulSoup to parse and extract data from HTML.
Elasticsearch Integration: Stores the scraped articles in an Elasticsearch database.
Error Handling: Robust error handling and logging mechanisms.

Requirements

Python 3.x
Docker (for running Elasticsearch)
The following Python packages (listed in requirements.txt):
- selenium
- beautifulsoup4
- requests
- elasticsearch

Setup

1. Clone the Repository

git clone https://github.com/yourusername/prothom-alo-crawler.git
cd prothom-alo-crawler

2. Install Python Dependencies

pip install -r requirements.txt

3. Set Up Elasticsearch with Docker

Ensure Docker is installed on your system. Then, run the following command to start Elasticsearch:

docker-compose up -d

This will start an Elasticsearch instance on localhost:9200.

4. Run the Crawler

Execute the main script to start crawling:

python protom_alo_crawler.py

Key Files

protom_alo_crawler.py: Contains the ProthomAloCrawler class which extends the NewsCrawler class to implement site-specific crawling logic.
generalized_news_crawler.py: Defines the NewsCrawler abstract base class, providing common functionality for web scraping and Elasticsearch integration.
docker_compose.yml: Docker Compose configuration for setting up Elasticsearch.
requirements.txt: Lists the Python dependencies required for the project.

How It Works

Crawling Articles

Initialization: The ProthomAloCrawler class is initialized with the base URL and Elasticsearch host/port.
Fetching URLs: The get_article_urls method fetches article URLs for a given date by navigating to the search page and clicking the "Load More" button until all articles are loaded.
Parsing Articles: The parse_article method extracts details like headline, publication date, content, category, and suggested articles from each article page.
Saving to Elasticsearch: Parsed articles are saved to Elasticsearch for easy querying and analysis.

Contributing

Feel free to fork this repository and submit pull requests. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License.

Happy Crawling! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
docker_compose.yml		docker_compose.yml
generalzed_news_crawler.py		generalzed_news_crawler.py
protom_alo_crawler.py		protom_alo_crawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prothom Alo News Crawler 📰

Overview

Features

Requirements

Setup

1. Clone the Repository

2. Install Python Dependencies

3. Set Up Elasticsearch with Docker

4. Run the Crawler

Key Files

How It Works

Crawling Articles

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ai-naymul/protom_alo-newspaper-scraper

Folders and files

Latest commit

History

Repository files navigation

Prothom Alo News Crawler 📰

Overview

Features

Requirements

Setup

1. Clone the Repository

2. Install Python Dependencies

3. Set Up Elasticsearch with Docker

4. Run the Crawler

Key Files

How It Works

Crawling Articles

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages