A comprehensive Python tool for scraping HTML tables, analyzing user data, and generating detailed reports with visualizations.
-
Clone the Repository
git clone https://github.com/pkscripts/web_scraper_bs4 cd web_scraper_bs4
-
Install Dependencies
pip install -r requirements.txt
web_scraper_bs4/
│
├── web_scraper_project.py # Main script
├── index.html # Input HTML file
├── requirements.txt # Project dependencies
├── README.md # Documentation
│
└── reports/ # Generated reports directory
├── user_data_20240312_143022.csv
├── report_20240312_143022.json
└── country_distribution_20240312_143022.png
-
Data Extraction
- User ID and basic information
- Contact details (email)
- Geographic data (country)
- Account status and join dates
- Profile URLs and avatar images
-
Report Generation
- CSV exports of raw data
- JSON formatted analysis
- Country-wise distribution charts
- Statistical summaries
-
Visualizations
- Bar charts for user distribution
- Country-wise analysis graphs
- Automated chart generation
-
Prepare Input File
- Place your HTML file named
index.html
in the project directory - Ensure it contains user data in table format with class="user-row"
- Place your HTML file named
-
Run the Script
python web_scraper_project.py
-
Console Output
Country-wise User Report: +---------------+------------------+-------------------------+ | Country | Number of Users | Users | +---------------+------------------+-------------------------+ | United States | 25 | John Doe, Jane Smith... | | Canada | 15 | Mike Ross, Rachel Z... | | UK | 10 | James Bond, Emma W... | +---------------+------------------+-------------------------+ Detailed Statistics: Total number of countries: 12 Total number of users: 150 Active users: 142 Average users per country: 12.50 Top 5 countries by number of users: United States: 25 users Canada: 15 users UK: 10 users Germany: 8 users France: 7 users Reports have been generated and saved in the 'reports' directory.
-
Generated Files
a. CSV Report (user_data_[timestamp].csv):
id,name,email,country,join_date,status 001,John Doe,john@example.com,United States,2024-01-15,active 002,Jane Smith,jane@example.com,Canada,2024-01-16,active ...
b. JSON Report (report_[timestamp].json):
{ "country_wise": [ { "Country": "United States", "Number of Users": 25, "Users": "John Doe, Jane Smith..." } ], "statistics": { "total_countries": 12, "total_users": 150, "active_users": 142, "average_users_per_country": 12.5 } }
- Python 3.6+
- beautifulsoup4>=4.9.3
- tabulate>=0.8.9
- pandas>=1.3.0
- matplotlib>=3.4.3
The script includes error handling for common issues:
- Missing input file
- Invalid HTML structure
- File permission errors
- Data extraction failures
-
File Not Found Error
Error: Could not find index.html Solution: Ensure index.html exists in the script directory
-
Parse Error
Error: Unable to parse HTML content Solution: Verify HTML file format and encoding
-
Permission Error
Error: Cannot create reports directory Solution: Check write permissions in the current directory
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Your Name
- GitHub: @Pandiyaraj