GitHub - hanishi/pekko-playwright: Apache Pekko based web crawler that uses Playwright to crawl websites and extract text data and links for further processing.

🕷️ Crawler

Apache Pekko-based web crawler using Playwright to extract structured text and link data from dynamic websites.

This project combines the concurrency model of Apache Pekko with the browser automation power of Microsoft Playwright to build a scalable, actor-based scraping system.

It supports:

Headless browser automation
DOM content extraction
Click-based interaction (e.g. expand buttons)
Retry logic and error handling
Parallel scraping via actor supervision
Proxy support for IP rotation see Proxy Configuration for details

🚧 Project Status: Work in Progress

This project is still under active development. You can try it out by running the tests provided in the tests/ folder to see the current scraping logic and data extraction behavior in action.

🎥 Scraping in Action

Screen.Recording.2025-07-18.at.0.17.08.mov

🧪 Bonus: IAB Taxonomy + OpenAI Integration

I also started experimenting with OpenAI to classify articles using the IAB Content Taxonomy. Right now, it’s a prototype—just a method in a test file (PublisherSiteSpec) acting as a quick main() substitute. But it works.

🤔 Why IAB Taxonomy?

For those not deep in AdTech:

The IAB Content Taxonomy is a standardized list of content categories like “Technology”, “Parenting”, “Investing”, etc. It’s widely used in digital advertising to describe the context of content.

Why it matters for publishers:

Higher CPMs – Better tagging → better targeting → better bids
Brand Safety – Advertisers avoid “unsafe” topics; classification helps stay eligible
Programmatic Bidding – Taxonomy tags are passed in OpenRTB/header bidding auctions

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
project		project
src		src
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

hanishi/pekko-playwright

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages