🕷️ Tasnim Crawler

A simple multithreaded web crawler that scrapes news article titles and content from Hamshahri Online and stores the results in a CSV file.

📌 Overview

This Python project uses:

requests and BeautifulSoup for HTTP requests and HTML parsing
concurrent.futures.ThreadPoolExecutor for multithreaded crawling
pandas to manage and export the scraped data

🧠 How It Works

get_links():
- Scrapes all anchor (<a>) tags from the homepage.
- Filters for URLs containing the word "news".
- Appends the root URL to relative paths.
get_info(url):
- Fetches the HTML content of a given news article.
- Extracts the title and visible text.
TasnimCrawler():
- Calls get_links() to collect article URLs.
- Uses multithreading to scrape each article in parallel.
- Stores the data in a Pandas DataFrame and exports it to tasnim.csv.

📁 File Structure

.
├── get_links.py       # Contains get_links() function
├── get_info.py        # Contains get_info() function
├── main.py            # Contains TasnimCrawler() function and runs it
├── tasnim.csv         # Output CSV with news data
└── README.md          # Project documentation

🚀 Usage

Install dependencies:

pip install requests beautifulsoup4 pandas

Run the crawler:

python main.py

Output:
- A CSV file named tasnim.csv will be created containing:
  - url – the article's URL
  - title – the page title
  - text – the extracted full text

⚠️ Notes

The crawler is currently hardcoded to work with https://www.hamshahrionline.ir/. You can modify the url_root parameter in get_links() to target other websites (ensure they're structured similarly).
Be respectful of websites' terms of service and do not overload their servers.

📄 License

This project is open-source and free to use.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
LICENSE		LICENSE
README.md		README.md
get_info.py		get_info.py
get_links.py		get_links.py
tasnim.csv		tasnim.csv
tasnim_crawler.py		tasnim_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🕷️ Tasnim Crawler

📌 Overview

🧠 How It Works

📁 File Structure

🚀 Usage

⚠️ Notes

📄 License

About

Uh oh!

Releases

Packages

Languages

License

irani-crawler/Tasnim-News-Agency-crawler

Folders and files

Latest commit

History

Repository files navigation

🕷️ Tasnim Crawler

📌 Overview

🧠 How It Works

📁 File Structure

🚀 Usage

⚠️ Notes

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages