GitHub - SPQR14/discovery_web_scrapper: This is a discovery for a web scrapper project.

Nearshoring and Sustainable Economy News Scraper

Description

This project aims to develop a web scraper using Python, Scrapy, Selenium, and Beautiful Soup to gather news articles related to nearshoring and sustainable economy initiatives from various online sources. The scraped data will be stored in a structured format for further analysis and visualization.

Websites to Scrape.

Forbes: https://www.forbes.com.mx/
Proceso: https://www.proceso.com.mx/
Expansión: https://expansion.mx/
El Reforma: https://www.reforma.com/
El Economista: https://www.eleconomista.com.mx/
Bloomberg: https://www.bloomberg.com/
El Financiero: https://www.elfinanciero.com.mx/
Thomson Reuters: https://www.thomsonreutersmexico.com/es-mx
KPGM: https://kpmg.com/mx/es/home.html
Deloitte: https://www2.deloitte.com/mx/es.html
IMCO: https://imco.org.mx/
Concanaco: https://www.concanaco.com.mx/
Secretaría de economía: https://www.gob.mx/se
Banco de México: https://www.banxico.org.mx/

Relevant terms.

Recolocación de la inversión en México.
Relocalización de las inversiones hacia México.
Inversión extranjera directa en México.
Nearshoring
Nearshoring en México
Recolocación de las cadenas de mininistro.
Inversión extranjera.

Technologies

Python: The primary programming language for the project.
Scrapy: A powerful web scraping framework for building efficient and scalable scrapers.
Selenium: A browser automation tool for handling dynamic web pages and JavaScript-rendered content.
Beautiful Soup: A Python library for parsing and extracting data from HTML and XML documents.

Project Setup

Clone the repository:

git clone https://github.com/SPQR14/discovery_web_scrapper.git

Install dependencies:

pip install -r requirements.txt

Scraping Process

Define target websites: Identify the list of websites from which news articles will be scraped.
Create Scrapy spiders: Develop Scrapy spiders for each target website, utilizing Selenium and Beautiful Soup to extract relevant news articles and their content.
Data storage: Implement a data storage mechanism to save the scraped data in a structured format, such as CSV or a database.

Running the Scraper

Execute Scrapy spiders: Run the Scrapy spiders using the Scrapy crawl command to initiate the scraping process.
Monitor and maintain: Regularly monitor the scraper's performance and update the spiders as needed to adapt to changes in the target websites.

Data Analysis and Visualization

Analyze scraped data: Perform data analysis on the scraped news articles to identify trends, patterns, and insights related to nearshoring and sustainable economy initiatives.
Create visualizations: Generate visualizations, such as charts and graphs, to present the findings from the data analysis in a clear and concise manner.

Project Contributions

Contributions to this project are welcome, including:

Improving scraper efficiency: Optimizing the scraping process to enhance speed and resource utilization.
Expanding data sources: Adding new target websites to broaden the scope of news coverage.
Enhancing data analysis: Developing more sophisticated data analysis techniques to extract deeper insights.
Creating interactive visualizations: Designing interactive visualizations to facilitate data exploration and understanding.

Future Directions

Machine learning integration: Incorporate machine learning techniques to classify news articles, identify sentiment, and summarize key points.
Real-time data streaming: Implement real-time data streaming to capture and process news articles as they are published.
Data sharing and collaboration: Establish a platform for sharing and collaborating on scraped data and analysis findings with the research community.

This project provides a valuable tool for gathering and analyzing news information related to nearshoring and sustainable economy initiatives. By combining web scraping techniques with data analysis and visualization, the project can contribute to a better understanding of these critical topics.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.virtual_documents		.virtual_documents
data		data
scrapy_practice		scrapy_practice
src		src
util		util
.gitignore		.gitignore
filter_by_term.ipynb		filter_by_term.ipynb
nearshoring.xlsx		nearshoring.xlsx
readme.md		readme.md
requirements.txt		requirements.txt
~$nearshoring.xlsx		~$nearshoring.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nearshoring and Sustainable Economy News Scraper

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SPQR14/discovery_web_scrapper

Folders and files

Latest commit

History

Repository files navigation

Nearshoring and Sustainable Economy News Scraper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages