Skip to content

Apache Pekko based web crawler that uses Playwright to crawl websites and extract text data and links for further processing.

hanishi/pekko-playwright

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕷️ Crawler

Apache Pekko-based web crawler using Playwright to extract structured text and link data from dynamic websites.

This project combines the concurrency model of Apache Pekko with the browser automation power of Microsoft Playwright to build a scalable, actor-based scraping system.

It supports:

  • Headless browser automation
  • DOM content extraction
  • Click-based interaction (e.g. expand buttons)
  • Retry logic and error handling
  • Parallel scraping via actor supervision
  • Proxy support for IP rotation see Proxy Configuration for details

đźš§ Project Status: Work in Progress

This project is still under active development. You can try it out by running the tests provided in the tests/ folder to see the current scraping logic and data extraction behavior in action.

🎥 Scraping in Action

Screen.Recording.2025-07-18.at.0.17.08.mov

đź§Ş Bonus: IAB Taxonomy + OpenAI Integration

I also started experimenting with OpenAI to classify articles using the IAB Content Taxonomy. Right now, it’s a prototype—just a method in a test file (PublisherSiteSpec) acting as a quick main() substitute. But it works.

🤔 Why IAB Taxonomy?

For those not deep in AdTech:

The IAB Content Taxonomy is a standardized list of content categories like “Technology”, “Parenting”, “Investing”, etc. It’s widely used in digital advertising to describe the context of content.

Why it matters for publishers:

  1. Higher CPMs – Better tagging → better targeting → better bids
  2. Brand Safety – Advertisers avoid “unsafe” topics; classification helps stay eligible
  3. Programmatic Bidding – Taxonomy tags are passed in OpenRTB/header bidding auctions

About

Apache Pekko based web crawler that uses Playwright to crawl websites and extract text data and links for further processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published