This project is a Python-based web scraper built using Scrapy to extract raw job-related data from wuzzuf.com. It focuses on extracting details like job titles, experience requirements, and other key data for later further analysis or display.
- Paignation: Effectively navigate and parse through multiple pages.
- Dynamic Content Handling: Effectively navigates and extracts data from complex, structured pages.
- CSS and XPath Selectors: Targets specific elements within HTML to ensure accurate scraping.
- Export Formats: Saves data in JSON or CSV for further analysis.
- Error Handling: Detects and skips missing or malformed elements without crashing.
- Optimized for Scalability: Designed for scraping multiple pages or sites with minimal modifications.
- Python 3.7+
- pip: Python package installer
- Install dependencies:
pip install scrapy
- Clone the Repository:
git clone https://github.com/yourusername/your-repo.git
cd your-repo
- Run the Scraper:
scrapy crawl job_spider -o jobs.csv
- Scrapy: For handling the scraping process.
- CSS Selectors and XPath: To target specific elements on the page.
- JSON/CSV: For data export.
- Handling duplicate classes: Solved using nth-child and sibling/child selectors.
- Working with JavaScript-rendered content: Adjusted scraping logic for better compatibility.
- Managing dynamic content: Focused on specific, nested elements to ensure data accuracy.
Tool | Strength | Weakness |
---|---|---|
Scrapy | High performance, asynchronous scraping | Steep learning curve |
BeautifulSoup | Easy to use for static pages | Limited for JavaScript content |
Selenium | Handles JavaScript and dynamic pages | Resource-intensive and slower |
[
{
"job_title": "Software Engineer",
"experience": "4 to 6 years",
"type": "Full-time",
"location": "Remote"
},
{
"job_title": "Data Scientist",
"experience": "2 to 4 years",
"type": "Part-time",
"location": "New York"
}
]
- Add proxy rotation to handle website bans.
- Integrate Selenium for JavaScript-heavy pages.
- Implement AI/ML to categorize and analyze scraped job data.
Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License. See the LICENSE
file for details.
Created by Raed Sherif. Feel free to reach out for collaboration or questions!