This repository provides two Python-based scrapers to extract data from TechCrunch articles. It includes:
- A listing scraper for fetching articles from multiple pages.
- A detail scraper to extract full content from individual article URLs.
📖 Read the full tutorial: How to Scrape TechCrunch with Python
requests– for HTTP requestsBeautifulSoup– for HTML parsingpandas– for saving data to CSV
Install required dependencies:
pip install requests beautifulsoup4 pandas-
Scrapes article listings from the TechCrunch homepage or paginated archive.
-
Extracts:
-
Title
-
Link
-
Author
-
Publication Date
-
Summary
Edit the number of pages to scrape (num_pages_to_scrape) and run:
python techcrunch_listing_scraper.pySaves article listing data to techcrunch_listing.csv.
Title,Link,Author,Publication Date,Summary
"AI startup gets acquired","https://techcrunch.com/2024/08/10/example-article/","Jane Doe","2024-08-10","This startup is changing the game..."
...
Scrapes individual article URLs and extracts:
- Title
- Author
- Publication Date
- Full Article Content
Update the article_urls list with your desired TechCrunch article links and run:
python techcrunch_article_scraper.pySaves article content to techcrunch_articles.csv.
Title,Author,Publication Date,Content
"AI startup gets acquired","Jane Doe","2024-08-10","TechCrunch reports the acquisition of..."
...
- Add CLI support for inputting URLs.
- Add option to save data in JSON.
- Retry logic for failed requests.
- Proxy support for stealth scraping.
- Track breaking news in the tech world.
- Monitor coverage of startups or competitors.
- Build datasets for NLP and content analysis.