✨ Add Media Collection to Scraping Pipeline

Summary

This PR introduces the collect_media function, which enhances scraping capabilities by automatically detecting and downloading various types of media assets from a web page using a Selenium-controlled browser session.

🔧 Features

Supported Media Types:

Images (<img>)
Videos (<video>)
Audio files (<audio>)
PDFs (<a href="*.pdf">)
Documents (.doc, .docx, .txt, .rtf)
Presentations (.ppt, .pptx)
Spreadsheets (.xls, .xlsx, .csv)

Functionality:

Uses CSS selectors to find elements containing media links.
Downloads each valid media file (HTTP/HTTPS only).
Saves all assets to a structured media/ directory, grouped by media type.
Writes a download_summary.txt with the original URLs and their local file paths.

Error Handling:

Skips failed downloads and logs the error.
Generates fallback filenames when none are detected in the URL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v1.0.6 (Media Collection)

✨ Add Media Collection to Scraping Pipeline

Summary

🔧 Features

Supported Media Types:

Functionality:

Error Handling:

Uh oh!