A simple multithreaded web crawler that scrapes news article titles and content from Hamshahri Online and stores the results in a CSV file.
This Python project uses:
requests
andBeautifulSoup
for HTTP requests and HTML parsingconcurrent.futures.ThreadPoolExecutor
for multithreaded crawlingpandas
to manage and export the scraped data
-
get_links()
:- Scrapes all anchor (
<a>
) tags from the homepage. - Filters for URLs containing the word "news".
- Appends the root URL to relative paths.
- Scrapes all anchor (
-
get_info(url)
:- Fetches the HTML content of a given news article.
- Extracts the title and visible text.
-
TasnimCrawler()
:- Calls
get_links()
to collect article URLs. - Uses multithreading to scrape each article in parallel.
- Stores the data in a Pandas DataFrame and exports it to
tasnim.csv
.
- Calls
.
├── get_links.py # Contains get_links() function
├── get_info.py # Contains get_info() function
├── main.py # Contains TasnimCrawler() function and runs it
├── tasnim.csv # Output CSV with news data
└── README.md # Project documentation
- Install dependencies:
pip install requests beautifulsoup4 pandas
- Run the crawler:
python main.py
- Output:
- A CSV file named
tasnim.csv
will be created containing:url
– the article's URLtitle
– the page titletext
– the extracted full text
- A CSV file named
- The crawler is currently hardcoded to work with
https://www.hamshahrionline.ir/
. You can modify theurl_root
parameter inget_links()
to target other websites (ensure they're structured similarly). - Be respectful of websites' terms of service and do not overload their servers.
This project is open-source and free to use.