📊 Web Scraping, Crawling, and Data Analysis for Museums

📝 Description

This project is a web scraping and data analysis tool that extracts, processes, and visualizes specific data labels using various Python libraries efficiently. It includes:

Web Scraping functionalities extracting the following labels that could be changed as desired: [ Person, Date, Place, City, Country].
Generates Insights in CSV Files including the number of occurrences for each entity, this is applicable for a single web page, or a full website.
Word Cloud Generation from the CSV files, where word clouds images are created based on the occurrences.
Network Graph Visualization to represent relationships between entities with two types of graphs, interactive and static graphs.
For Both English and Arabic Languages using language-specific processing techniques. For Arabic, the accuracy of entities extraction is lower due to limitations in GLiNER.

🚀 Features

✅ Web scraping using requests and BeautifulSoup

✅ URL processing with urllib.parse

✅ Data processing with pandas

✅ Word cloud visualization with wordcloud

✅ Arabic text reshaping with arabic-reshaper and python-bidi

✅ Network graph generation with networkx and d3.js

✅ Named entity recognition with GLiNER

🔧 Installation

Prerequisites for the scraping tool

Ensure you have Python installed on your system. Install the required dependencies using:

pip install -r requirements.txt

Prerequisites for the interactive visualization tool

Ensure you have CSV files with multiple Artists/Objects and in the following format

📌 Usage for the scraping tool

1️⃣ Run the main.py:

python main.py

2️⃣ Enter your prefered choices:

3️⃣ Utilize the outputs and visualizations (word clouds and static network graphs):

📌 Usage for the interactive visualization tool

1️⃣ Run the local server at the folder path with the desired port:

python -m http.server 8000

2️⃣ Upload the CSV file:

3️⃣ Utilize the filtering options and visualize the relationships:

Filter and show relationships between artists based on connected persons:

Filter and show relationships between artist based on connected places:

📂 File Structure for the scraping tool

├── main.py                      # Main
├── scrapper_v2.py               # Web scraper and Data processing
├── finalCrawling.py             # Web crawling
├── finalMapping_v2.py           # Nationality and country detection and mapping
├── graphs.py                    # Graphs generator
├── finalWordCloud.py            # Word cloud generator
├── requirements.txt             # Dependencies
├── README.md                    # Project documentation
├── Amiri-Regular.ttf            # Font for word cloud
├── DejaVuSans.ttf               # Font for word cloud
├── countries_and_demonyms.csv   # Important data for finalMapping_v2.py

📂 File Structure for the interactive visualization tool

├── index.html                  # Main HTML file for visualization
├── script.js                   # JavaScript file containing D3.js logic

🤝 Contributing

Feel free to submit issues or pull requests for improvements! We welcome contributions to enhance this project.

📢 Declaration

This project was developed by interns and utilizes open-source libraries and existing tools. We acknowledge and appreciate the efforts of the open-source community in making these resources available.

Contributors:

Leen Koree AND AlDanah AlAnazi

🔗 Happy Coding! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📊 Web Scraping, Crawling, and Data Analysis for Museums

📝 Description

🚀 Features

🔧 Installation

Prerequisites for the scraping tool

Prerequisites for the interactive visualization tool

📌 Usage for the scraping tool

1️⃣ Run the main.py:

2️⃣ Enter your prefered choices:

3️⃣ Utilize the outputs and visualizations (word clouds and static network graphs):

📌 Usage for the interactive visualization tool

1️⃣ Run the local server at the folder path with the desired port:

2️⃣ Upload the CSV file:

3️⃣ Utilize the filtering options and visualize the relationships:

📂 File Structure for the scraping tool

📂 File Structure for the interactive visualization tool

🤝 Contributing

📢 Declaration

Contributors:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Extracted entities examples		Extracted entities examples
d3-tool		d3-tool
Amiri-Regular.ttf		Amiri-Regular.ttf
DejaVuSans.ttf		DejaVuSans.ttf
README.md		README.md
countries_and_demonyms.csv		countries_and_demonyms.csv
finalCrawling.py		finalCrawling.py
finalMapping_v2.py		finalMapping_v2.py
finalWordCloud.py		finalWordCloud.py
graphs.py		graphs.py
main.py		main.py
scrapper_all.py		scrapper_all.py
scrapper_v2.py		scrapper_v2.py

toaziz/web-scrapping-tool

Folders and files

Latest commit

History

Repository files navigation

📊 Web Scraping, Crawling, and Data Analysis for Museums

📝 Description

🚀 Features

🔧 Installation

Prerequisites for the scraping tool

Prerequisites for the interactive visualization tool

📌 Usage for the scraping tool

1️⃣ Run the main.py:

2️⃣ Enter your prefered choices:

3️⃣ Utilize the outputs and visualizations (word clouds and static network graphs):

📌 Usage for the interactive visualization tool

1️⃣ Run the local server at the folder path with the desired port:

2️⃣ Upload the CSV file:

3️⃣ Utilize the filtering options and visualize the relationships:

📂 File Structure for the scraping tool

📂 File Structure for the interactive visualization tool

🤝 Contributing

📢 Declaration

Contributors:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages