Skip to content

Bugsterapp/scrapester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Scrapester

Scrapester Logo

PyPI version npm version

npm downloads PyPI downloads

GitHub stars Last Commit

Documentation | Python SDK | JavaScript SDK | Playground

Turn any website into LLM-ready clean data.

Overview

Scrapester is a powerful web scraping tool that converts website content into clean, markdown-formatted data perfect for LLM processing. With support for both single-page scraping and full website crawling, Scrapester makes it easy to gather web content in a structured, consistent format.

Features

  • πŸ” Smart Content Extraction: Automatically removes noise and extracts meaningful content
  • πŸ“ Markdown Output: Clean, structured content perfect for LLMs
  • πŸ•·οΈ Website Crawling: Scrape entire websites with configurable depth and limits
  • πŸš€ Multiple SDKs: Official Python and JavaScript support
  • ⚑ High Performance: Built for speed and reliability
  • πŸ›‘οΈ Error Handling: Robust error handling and rate limiting protection

Installation

Python

pip install scrapester

JavaScript/TypeScript

npm install scrapester
# or
yarn add scrapester

Quick Start

Python

from scrapester import ScrapesterApp

# Initialize the client
app = ScrapesterApp(api_key="your-api-key")

# Scrape a single page
result = app.scrape("https://example.com")
print(result.markdown)

# Crawl an entire website
results = app.crawl(
    "https://example.com",
    options={
        "max_pages": 10,
        "max_depth": 2
    }
)

JavaScript/TypeScript

import { ScrapesterApp } from 'scrapester';

// Initialize the client
const app = new ScrapesterApp('your-api-key');

// Scrape a single page
const result = await app.scrape('https://example.com');
console.log(result.markdown);

// Crawl an entire website
const results = await app.crawl('https://example.com', {
    maxPages: 10,
    maxDepth: 2
});

Response Format

Scrapester returns clean, structured data in the following format:

interface CrawlerResponse {
    url: string;          // The scraped URL
    markdown: string;     // Clean, markdown-formatted content
    metadata: {          // Page metadata
        title: string,
        description: string,
        // ... other meta tags
    };
    timestamp: string;   // ISO timestamp of when the page was scraped
}

API Reference

ScrapesterApp

Constructor

new ScrapesterApp(
    apiKey: string,
    baseUrl?: string,    // default: "http://localhost:8000"
    timeout?: number     // default: 600 seconds
)

Methods

scrape(url: string)

Scrapes a single URL and returns clean, markdown-formatted content.

crawl(url: string, options?)

Crawls a website starting from the given URL. Options include:

  • maxPages: Maximum number of pages to crawl
  • maxDepth: Maximum crawl depth
  • includePatterns: URL patterns to include
  • excludePatterns: URL patterns to exclude

Error Handling

Scrapester provides detailed error information through the APIError class:

class APIError extends Error {
    statusCode?: number;
    response?: object;
}

Common error scenarios:

  • 429: Rate limit exceeded
  • 400: Invalid request
  • 401: Invalid API key
  • 500: Server error

Development

Running Tests

# Python
pytest tests/

# JavaScript
npm test

Building from Source

# Python
pip install -e ".[dev]"

# JavaScript
npm install
npm run build

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Support

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Turn any website into LLM structured data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published