This project is a web scraping and data analysis tool that extracts, processes, and visualizes specific data labels using various Python libraries efficiently. It includes:
- Web Scraping functionalities extracting the following labels that could be changed as desired: [ Person, Date, Place, City, Country].
- Generates Insights in CSV Files including the number of occurrences for each entity, this is applicable for a single web page, or a full website.
- Word Cloud Generation from the CSV files, where word clouds images are created based on the occurrences.
- Network Graph Visualization to represent relationships between entities with two types of graphs, interactive and static graphs.
- For Both English and Arabic Languages using language-specific processing techniques. For Arabic, the accuracy of entities extraction is lower due to limitations in GLiNER.
✅ Web scraping using requests
and BeautifulSoup
✅ URL processing with urllib.parse
✅ Data processing with pandas
✅ Word cloud visualization with wordcloud
✅ Arabic text reshaping with arabic-reshaper
and python-bidi
✅ Network graph generation with networkx
and d3.js
✅ Named entity recognition with GLiNER
Ensure you have Python installed on your system. Install the required dependencies using:
pip install -r requirements.txt
Ensure you have CSV files with multiple Artists/Objects and in the following format
python main.py
python -m http.server 8000
Filter and show relationships between artists based on connected persons:
Filter and show relationships between artist based on connected places:
├── main.py # Main
├── scrapper_v2.py # Web scraper and Data processing
├── finalCrawling.py # Web crawling
├── finalMapping_v2.py # Nationality and country detection and mapping
├── graphs.py # Graphs generator
├── finalWordCloud.py # Word cloud generator
├── requirements.txt # Dependencies
├── README.md # Project documentation
├── Amiri-Regular.ttf # Font for word cloud
├── DejaVuSans.ttf # Font for word cloud
├── countries_and_demonyms.csv # Important data for finalMapping_v2.py
├── index.html # Main HTML file for visualization
├── script.js # JavaScript file containing D3.js logic
Feel free to submit issues or pull requests for improvements! We welcome contributions to enhance this project.
This project was developed by interns and utilizes open-source libraries and existing tools. We acknowledge and appreciate the efforts of the open-source community in making these resources available.
Leen Koree AND AlDanah AlAnazi
🔗 Happy Coding! 🚀