Skip to content

Web scraping tool to extract information from museum encyclopedia and collection websites.

Notifications You must be signed in to change notification settings

toaziz/web-scrapping-tool

 
 

Repository files navigation

📊 Web Scraping, Crawling, and Data Analysis for Museums

📝 Description

This project is a web scraping and data analysis tool that extracts, processes, and visualizes specific data labels using various Python libraries efficiently. It includes:

  • Web Scraping functionalities extracting the following labels that could be changed as desired: [ Person, Date, Place, City, Country].
  • Generates Insights in CSV Files including the number of occurrences for each entity, this is applicable for a single web page, or a full website.
  • Word Cloud Generation from the CSV files, where word clouds images are created based on the occurrences.
  • Network Graph Visualization to represent relationships between entities with two types of graphs, interactive and static graphs.
  • For Both English and Arabic Languages using language-specific processing techniques. For Arabic, the accuracy of entities extraction is lower due to limitations in GLiNER.

🚀 Features

✅ Web scraping using requests and BeautifulSoup

✅ URL processing with urllib.parse

✅ Data processing with pandas

✅ Word cloud visualization with wordcloud

✅ Arabic text reshaping with arabic-reshaper and python-bidi

✅ Network graph generation with networkx and d3.js

✅ Named entity recognition with GLiNER


🔧 Installation

Prerequisites for the scraping tool

Ensure you have Python installed on your system. Install the required dependencies using:

pip install -r requirements.txt

Prerequisites for the interactive visualization tool

Ensure you have CSV files with multiple Artists/Objects and in the following format

image


📌 Usage for the scraping tool

1️⃣ Run the main.py:

python main.py

2️⃣ Enter your prefered choices:

image

3️⃣ Utilize the outputs and visualizations (word clouds and static network graphs):

image

image

📌 Usage for the interactive visualization tool

1️⃣ Run the local server at the folder path with the desired port:

python -m http.server 8000

2️⃣ Upload the CSV file:

image

3️⃣ Utilize the filtering options and visualize the relationships:

Filter and show relationships between artists based on connected persons: image

Filter and show relationships between artist based on connected places: image


📂 File Structure for the scraping tool

├── main.py                      # Main
├── scrapper_v2.py               # Web scraper and Data processing
├── finalCrawling.py             # Web crawling
├── finalMapping_v2.py           # Nationality and country detection and mapping
├── graphs.py                    # Graphs generator
├── finalWordCloud.py            # Word cloud generator
├── requirements.txt             # Dependencies
├── README.md                    # Project documentation
├── Amiri-Regular.ttf            # Font for word cloud
├── DejaVuSans.ttf               # Font for word cloud
├── countries_and_demonyms.csv   # Important data for finalMapping_v2.py 

📂 File Structure for the interactive visualization tool

├── index.html                  # Main HTML file for visualization
├── script.js                   # JavaScript file containing D3.js logic 

🤝 Contributing

Feel free to submit issues or pull requests for improvements! We welcome contributions to enhance this project.


📢 Declaration

This project was developed by interns and utilizes open-source libraries and existing tools. We acknowledge and appreciate the efforts of the open-source community in making these resources available.

Contributors:

Leen Koree AND AlDanah AlAnazi


🔗 Happy Coding! 🚀

About

Web scraping tool to extract information from museum encyclopedia and collection websites.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.0%
  • HTML 3.0%