Because clearly what the world needed was another way to scrape websites that don't want to be scraped.
Listen up, youngsters. Back in my day, websites were just HTML files. Clean, simple, readable. You could fetch them with a single HTTP request and be done with it. No JavaScript shenanigans, no reactive nonsense, no "single page applications" that are actually 15MB of obfuscated code.
But we live in darker times now.
So I made this. A REST API wrapper around Playwright that lets you extract what you need from the cesspool of modern web development without losing your sanity (what's left of it, anyway).
This service provides two primary endpoints:
/screenshot
- Takes pictures of websites. Revolutionary, I know./page-dump
- Gets you the ACTUAL content of a page, including HTML, JavaScript, styles, and (most importantly) the variables lurking in the DOM. Because apparently that's where people hide the good stuff now.
It runs headless browsers in a pool so you don't have to manage them yourself. It caches results so you don't hammer the same sites repeatedly like some kind of scraping amateur. It's containerized because that's what we do now, apparently.
You'll need Docker. If you don't have Docker, go get Docker. I'm not explaining how to install Python packages in 2025.
# Clone the repo
git clone https://github.com/heysamtexas/REST-headless-browser.git
# Build the Docker image
docker build -t rest-headless-browser .
# Run the container
docker run -p 8000:8000 rest-headless-browser
# Run with custom configuration
docker run -e BROWSER_MAX_BROWSERS=3 -e BROWSER_IDLE_TIMEOUT=180 -p 8000:8000 rest-headless-browser
Or use docker-compose if you're feeling fancy:
docker-compose up
Oh wait, there's no docker-compose.yml file yet. Add it to the TODO list. Fine, I'll do it myself eventually.
curl -X POST "http://localhost:8000/screenshot" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "width": 1280, "height": 800, "format": "png", "full_page": true}'
This returns a binary image. Save it, look at it, do whatever people do with screenshots these days.
curl -X GET "http://localhost:8000/page-dump?url=https://example.com"
This returns a JSON object containing:
html
: The full HTML of the pagescripts
: All scripts, including their contentstylesheets
: All CSS rules (that aren't blocked by CORS, because of course they are)variables
: Global variables, localStorage, and sessionStorageimages
: All images on the pagelinks
: All links on the page
Use this to find the data you actually care about before writing a more targeted scraper.
The service uses dynamic browser management - it starts with 0 browsers and creates them on-demand up to a configurable maximum. Idle browsers are automatically shut down after 5 minutes to save memory.
BROWSER_MAX_BROWSERS
: Maximum concurrent browsers (default: 2, minimum: 1)BROWSER_IDLE_TIMEOUT
: Browser idle timeout in seconds (default: 300)
The page cache keeps results for an hour. Adjust the TTL in src/main.py
if you need to.
The service is optimized for efficient memory usage:
- Idle state: ~160MB (0 browsers running)
- Active usage: ~250MB per browser instance
- Scaling: Creates browsers on-demand when requests arrive
- Cleanup: Automatically shuts down idle browsers after timeout period
This means your service will use minimal resources when idle and scale up only when needed.
Because I'm tired of websites that:
- Load content with JavaScript long after the DOM is "ready"
- Hide data in JavaScript variables instead of the HTML
- Use seventeen layers of divs to display what should be a simple table
- Implement "innovative" scroll behavior that breaks normal scraping
- Change their DOM structure every two weeks for "improved user experience"
I've been scraping websites since before half of today's "senior" developers were born. Trust me when I say this is the easiest way to deal with the abomination that is modern web development.
Because there's always more to do:
- Add a proper docker-compose.yml file for production setups
- Implement proper error handling that doesn't just dump tracebacks
- Add rate limiting to prevent self-DoS
- Create actual documentation for the API
- Add support for cookies and sessions
- Implement proxy rotation
- Clean up that experimental browser extension... or just remove it
- Add authentication so the whole internet doesn't use your instance
- Write some actual tests (ha!)
Found a bug? Fixed a bug? Added a feature? Submit a PR. I'll review it when I get around to it.
Want to complain? Open an issue. I'll read it while sipping coffee and reminiscing about the days when "web scraping" meant "wget" and a grep command.
MIT License. Do what you want, just don't blame me when it breaks.
This tool won't solve all your scraping problems. Nothing will. The web is a constantly evolving battlefield between those who want to share information and those who want to control how it's accessed.
But at least with this, you have a fighting chance.
Now get off my lawn.