v1.0.6 (Media Collection)
✨ Add Media Collection to Scraping Pipeline
Summary
This PR introduces the collect_media
function, which enhances scraping capabilities by automatically detecting and downloading various types of media assets from a web page using a Selenium-controlled browser session.
🔧 Features
Supported Media Types:
- Images (
<img>
) - Videos (
<video>
) - Audio files (
<audio>
) - PDFs (
<a href="*.pdf">
) - Documents (
.doc
,.docx
,.txt
,.rtf
) - Presentations (
.ppt
,.pptx
) - Spreadsheets (
.xls
,.xlsx
,.csv
)
Functionality:
- Uses CSS selectors to find elements containing media links.
- Downloads each valid media file (HTTP/HTTPS only).
- Saves all assets to a structured
media/
directory, grouped by media type. - Writes a
download_summary.txt
with the original URLs and their local file paths.
Error Handling:
- Skips failed downloads and logs the error.
- Generates fallback filenames when none are detected in the URL.