WaterCrawl PHP Client

PHP Client for WaterCrawl REST APIs. This package provides a simple and elegant way to interact with WaterCrawl's web scraping and crawling services.

Installation

You can install the package via composer:

composer require watercrawl/php

Requirements

PHP 7.4 or higher
ext-mbstring
ext-json

Usage

use WaterCrawl\APIClient;

// Initialize the client
$client = new APIClient('your-api-key');

// Scrape a single URL
$result = $client->scrapeUrl('https://example.com');

// Create a crawl request
$result = $client->createCrawlRequest(
    'https://example.com',
    ['allowed_domains' => ['example.com']],
    ['wait_time' => 1000]
);

// Monitor crawl progress
foreach ($client->monitorCrawlRequest($result['uuid']) as $update) {
    if ($update['type'] === 'result') {
        // Process the result
        print_r($update['data']);
    }
}

API Examples

Crawling Operations

List all crawl requests

// Get the first page of requests (default page size: 10)
$requests = $client->getCrawlRequestsList();

// Specify page number and size
$requests = $client->getCrawlRequestsList(2, 20);

Get a specific crawl request

$request = $client->getCrawlRequest('request-uuid');

Create a crawl request

// Simple request with just a URL
$request = $client->createCrawlRequest('https://example.com');

// Advanced request with options
$request = $client->createCrawlRequest(
    'https://example.com',
    [
        'max_depth' => 1, // maximum depth to crawl
        'page_limit' => 1, // maximum number of pages to crawl
        'allowed_domains' => [], // allowed domains to crawl
        'exclude_paths' => [], // exclude paths
        'include_paths' => [] // include paths
    ],
    [
        'exclude_tags' => [], // exclude tags from the page
        'include_tags' => [], // include tags from the page
        'wait_time' => 1000, // wait time in milliseconds after page load
        'include_html' => false, // the result will include HTML
        'only_main_content' => true, // only main content of the page
        'include_links' => false, // if true the result will include links
        'timeout' => 15000, // timeout in milliseconds
        'accept_cookies_selector' => null, // accept cookies selector
        'locale' => "en-US", // locale
        'extra_headers' => [], // extra headers
        'actions' => [] // actions to perform
    ],
    [] // plugin options
);

Stop a crawl request

$client->stopCrawlRequest('request-uuid');

Download a crawl request result

// Download the crawl request results
$results = $client->downloadCrawlRequest('request-uuid');

Monitor a crawl request

// Monitor with automatic result download (default)
foreach ($client->monitorCrawlRequest('request-uuid') as $event) {
    if ($event['type'] === 'state') {
        echo "Crawl state: {$event['data']['status']}\n";
    } elseif ($event['type'] === 'result') {
        echo "Received result for: {$event['data']['url']}\n";
    }
}

// Monitor without downloading results
foreach ($client->monitorCrawlRequest('request-uuid', false) as $event) {
    echo "Event type: {$event['type']}\n";
}

Get crawl request results

// Get the first page of results
$results = $client->getCrawlRequestResults('request-uuid');

// Specify page number and size
$results = $client->getCrawlRequestResults('request-uuid', 2, 20);

Quick URL scraping

// Synchronous scraping (default)
$result = $client->scrapeUrl('https://example.com');

// With page options
$result = $client->scrapeUrl(
    'https://example.com',
    [
        'wait_time' => 1000,
        'only_main_content' => true
    ]
);

// Asynchronous scraping
$request = $client->scrapeUrl('https://example.com', [], [], false);
// Later check for results with getCrawlRequest

Sitemap Operations

Download a sitemap

// Download using a crawl request object
$crawlRequest = $client->getCrawlRequest('request-uuid');
$sitemap = $client->downloadSitemap($crawlRequest);

// Or download using just the UUID
$sitemap = $client->downloadSitemap('request-uuid');

// Process sitemap entries
foreach ($sitemap as $entry) {
    echo "URL: {$entry['url']}, Title: {$entry['title']}\n";
}

Download sitemap as graph data

// You need to provide crawl request uuid or crawl request object
$graphData = $client->downloadSitemapGraph('request-uuid');

Download sitemap as markdown

// You need to provide crawl request uuid or crawl request object
$markdown = $client->downloadSitemapMarkdown('request-uuid');

// Save to a file
file_put_contents('sitemap.md', $markdown);

Search Operations

Get search requests list

// Get the first page of search requests
$searchRequests = $client->getSearchRequestsList();

// Specify page number and size
$searchRequests = $client->getSearchRequestsList(2, 20);

Create a search request

// Simple search with synchronous results
$results = $client->createSearchRequest('php programming');

// Search with options and limited results
$results = $client->createSearchRequest(
    'php tutorial',
    [
        'language' => null, // language code e.g. "en" or "fr"
        'country' => null, // country code e.g. "us" or "fr"
        'time_range' => 'any', // time range e.g. "any", "hour", "day", "week", "month", "year"
        'search_type' => 'web', // search type e.g. "web"
        'depth' => 'basic' // depth e.g. "basic", "advanced", "ultimate"
    ],
    5, // limit the number of results
    true, // wait for results
    true // download results
);

// Asynchronous search
$searchRequest = $client->createSearchRequest(
    'machine learning',
    [], // search options
    5, // limit the number of results
    false, // Don't wait for results
    false // Don't download results
);

Monitor a search request

// Monitor a search request
foreach ($client->monitorSearchRequest('search-uuid') as $event) {
    if ($event['type'] === 'state') {
        echo "Search state: {$event['status']}\n";
    }
}

// Monitor without downloading results
foreach ($client->monitorSearchRequest('search-uuid', false) as $event) {
    echo "Event: " . json_encode($event) . "\n";
}

Get a search request

$searchRequest = $client->getSearchRequest('search-uuid');

Stop a search request

$client->stopSearchRequest('search-uuid');

Features

Simple and intuitive API
Real-time crawl monitoring
Configurable scraping options
Automatic response handling
Support for sitemaps and search operations
PHP 7.4+ compatibility
Proper UTF-8 support

Testing

composer test

Compatibility

WaterCrawl API >= 0.7.1

Changelog

Please see CHANGELOG.md for more information on what has changed recently.

Contributing

Please see CONTRIBUTING.md for details.

Security

If you discover any security related issues, please email security@watercrawl.dev instead of using the issue tracker.

License

The MIT License (MIT). Please see License File for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
.releaserc.json		.releaserc.json
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
phpunit.xml		phpunit.xml

License

watercrawl/watercrawl-php

Folders and files

Latest commit

History

Repository files navigation

WaterCrawl PHP Client

Installation

Requirements

Usage

API Examples

Crawling Operations

List all crawl requests

Get a specific crawl request

Create a crawl request

Stop a crawl request

Download a crawl request result

Monitor a crawl request

Get crawl request results

Quick URL scraping

Sitemap Operations

Download a sitemap

Download sitemap as graph data

Download sitemap as markdown

Search Operations

Get search requests list

Create a search request

Monitor a search request

Get a search request

Stop a search request

Features

Testing

Compatibility

Changelog

Contributing

Security

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages