Skip to content

v1.0.6 (Media Collection)

Compare
Choose a tag to compare
@jaypyles jaypyles released this 10 May 20:15
· 72 commits to master since this release
8cd3059

✨ Add Media Collection to Scraping Pipeline

Summary

This PR introduces the collect_media function, which enhances scraping capabilities by automatically detecting and downloading various types of media assets from a web page using a Selenium-controlled browser session.

🔧 Features

Supported Media Types:

  • Images (<img>)
  • Videos (<video>)
  • Audio files (<audio>)
  • PDFs (<a href="*.pdf">)
  • Documents (.doc, .docx, .txt, .rtf)
  • Presentations (.ppt, .pptx)
  • Spreadsheets (.xls, .xlsx, .csv)

Functionality:

  • Uses CSS selectors to find elements containing media links.
  • Downloads each valid media file (HTTP/HTTPS only).
  • Saves all assets to a structured media/ directory, grouped by media type.
  • Writes a download_summary.txt with the original URLs and their local file paths.

Error Handling:

  • Skips failed downloads and logs the error.
  • Generates fallback filenames when none are detected in the URL.