Scrape Flow

Scrape Flow is a comprehensive web scraping and automation platform that allows users to create, schedule, and manage sophisticated web scraping workflows with an intuitive visual interface. This application provides a powerful yet user-friendly way to extract data from websites, process it with AI, perform complex data transformations, and deliver results through multiple channels including webhooks, email, and database storage.

Features

Core Capabilities

Visual Workflow Builder: Drag-and-drop interface for creating complex scraping workflows
22 Task Types: Comprehensive set of tasks across 7 categories for all automation needs
AI-Powered Extraction: Intelligent data extraction using AI models
Real-time Execution: Live workflow execution with detailed logging and monitoring
Advanced Scheduling: Hybrid cron system supporting both simple and complex schedules
Credit Management: Transparent usage-based billing with Stripe integration
Secure Credential Storage: Encrypted storage for API keys and authentication data
Multi-format Output: JSON, CSV, and custom data transformation support

Advanced Features

Conditional Logic: Dynamic workflow paths based on data conditions
Loop Processing: Iterate over data sets with configurable loop controls
Database Integration: Direct database queries and data storage
Email Automation: Send rich HTML/text emails with attachments
File Downloads: Automated file downloading and processing
Screenshot Capture: Full-page and viewport screenshots with quality controls
Table Extraction: Structured data extraction from HTML tables
Data Transformation: Built-in data processing and formatting tools

Enterprise Features

Scalable Architecture: Built for high-volume data processing
Analytics Dashboard: Comprehensive execution statistics and credit usage tracking
User Management: Multi-user support with Clerk authentication
Rate Limiting: Built-in protection against abuse
Webhook Integration: Real-time data delivery to external systems
Error Handling: Robust error recovery and retry mechanisms

Task Types

Browser Automation (5 Tasks)

Launch Browser (5 credits): Initialize browser and navigate to websites
Navigate URL (2 credits): Navigate to specific URLs within workflows
Page to HTML (2 credits): Extract full HTML content from web pages
Wait for Element (1 credit): Wait for specific elements to appear/disappear
Take Screenshot (3 credits): Capture full-page or viewport screenshots

Element Interaction (4 Tasks)

Click Element (3 credits): Interact with clickable elements on pages
Fill Input (1 credit): Fill form inputs with dynamic data
Scroll to Element (1 credit): Scroll to specific elements on pages
Wait Delay (1 credit): Add controlled delays in workflow execution

Data Extraction (3 Tasks)

Extract Text from Element (2 credits): Extract text content using CSS selectors
Extract Table Data (3 credits): Extract structured data from HTML tables
Extract Data with AI (4 credits): AI-powered intelligent data extraction

Data Processing (4 Tasks)

Read Property from JSON (1 credit): Extract specific properties from JSON data
Add Property to JSON (1 credit): Add new properties to JSON objects
Data Transform (2 credits): Advanced data transformation and formatting
Conditional Logic (1 credit): Create dynamic workflow paths

Control Flow (1 Task)

Loop (2 credits): Iterate over data collections with customizable controls

Communication (2 Tasks)

Send Email (3 credits): Send HTML/text emails with attachment support
Deliver via Webhook (1 credit): Send data to external APIs and services

Storage & Database (2 Tasks)

Download File (3 credits): Download and process files from URLs
Database Query (2 credits): Execute SQL queries and store results

AI & Advanced (1 Task)

Extract Data with AI (4 credits): Leverage AI for complex data extraction scenarios

Technology Stack

Frontend

Framework: Next.js 14 (App Router)
UI Components: Shadcn UI (built on Radix UI)
Styling: Tailwind CSS
State Management: TanStack Query (React Query)
Forms: React Hook Form with Zod validation
Data Visualization: Recharts
Flow Editor: XYFlow (React Flow)

Backend

API: Next.js Server Actions and API Routes
Database ORM: Prisma
Authentication: Clerk
Payments: Stripe
Scheduling: Custom cron scheduler with hybrid approach
Web Scraping: Puppeteer/Chromium
AI Integration: OpenAI
Email Service: Resend

Database

Primary Database: PostgreSQL
Schema Management: Prisma Migrations
Query Builder: Custom query execution for Database Query tasks

Infrastructure

Hosting: Vercel
Cron Jobs: Vercel Cron + Browser-based hybrid system
File Storage: Built-in file download and processing
Encryption: AES-256-CBC for credential storage

Live Demo

Experience Scrape Flow in action at our Live Demo

Getting Started

Prerequisites

Node.js (v18 or higher)
PostgreSQL database
Clerk account (for authentication)
Stripe account (for payments)
OpenAI API key (for AI features)
Resend API key (for email functionality)

Installation

Clone the repository:

git clone https://github.com/your-username/scrape-flow.git
cd scrape-flow

Install dependencies:

npm install

Set up environment variables: Create a .env file in the root directory and add the necessary environment variables (see Environment Variables section)
Run database migrations:

npx prisma migrate dev

Start the development server:

npm run dev

Architecture

Workflow Execution Engine

The application features a sophisticated workflow execution engine that handles:

Task Registry: Dynamic loading and validation of all task types
Execution Context: Isolated execution environments for each workflow run
Credit Tracking: Real-time credit consumption monitoring
Error Recovery: Automatic retry mechanisms and graceful error handling
Progress Tracking: Live execution progress updates

Data Flow

Workflow Definition: Users create workflows using the visual editor
Validation: Workflows are validated for completeness and credit requirements
Execution: Tasks are executed sequentially with data passing between them
Monitoring: Real-time execution monitoring with detailed logging
Results: Data is delivered through configured output channels

Workflow System

Visual Workflow Editor

Drag-and-Drop Interface: Intuitive task placement and connection
Real-time Validation: Immediate feedback on workflow completeness
Task Configuration: Detailed parameter configuration for each task
Preview Mode: Test workflows without consuming credits
Export/Import: Save and share workflow definitions

Execution Engine

Sequential Processing: Tasks execute in defined order
Data Persistence: Workflow state maintained throughout execution
Credit Management: Pre-execution credit validation and real-time tracking
Error Handling: Graceful failure recovery with detailed error messages
Execution History: Complete audit trail of all workflow runs

Scheduling & Automation

Hybrid Scheduling System

To overcome platform limitations, Scrape Flow implements a unique hybrid scheduling approach:

Vercel Cron Integration

Daily cron job configured in vercel.json
Serves as a backup execution trigger
Ensures workflows are checked at least once daily

Browser-Based Local Cron

Client-side component running in user browsers
Polls /api/workflows/cron endpoint at configurable intervals
Works in both development and production environments
Configurable polling frequency (default: 60 seconds)

Schedule Types

Simple Intervals

Seconds: 30s (every 30 seconds)
Minutes: 5m (every 5 minutes)
Hours: 2h (every 2 hours)
Days: 1d (daily)

Cron Expressions

Standard Cron: 0 9 * * 1-5 (weekdays at 9am)
Complex Patterns: */15 * * * * (every 15 minutes)
Custom Schedules: Full cron expression support

Resilience Features

Exponential Backoff: Automatic retry with increasing delays
Cache Prevention: Timestamp-based cache busting
Error Recovery: Continues operation during temporary failures
Resource Optimization: Minimal API calls when browser is inactive

Credit System & Billing

Credit-Based Usage

Transparent Pricing: Each task type has a defined credit cost
Pre-execution Validation: Workflows validate credit requirements before running
Real-time Tracking: Live credit consumption monitoring
Usage Analytics: Detailed breakdown of credit usage by task type

Billing Integration

Stripe Integration: Secure payment processing
Flexible Plans: Multiple credit packages available
Auto-refill: Optional automatic credit top-up
Transaction History: Complete billing history and receipts
Usage Forecasting: Predict credit needs based on workflow schedules

Credit Costs by Task Type

High Cost (5 credits): Launch Browser
Medium Cost (3-4 credits): AI Extraction, Screenshots, File Downloads, Email Sending
Standard Cost (2 credits): Data extraction, Database queries, Navigation
Low Cost (1 credit): Simple operations, waits, property manipulation

Security & Credentials

Authentication & Authorization

Clerk Integration: Secure user authentication and session management
Protected Routes: Middleware-based route protection
API Security: Secure API endpoints with proper authentication
User Isolation: Complete data separation between users

Credential Management

AES-256-CBC Encryption: Military-grade encryption for stored credentials
Secure Storage: Database-encrypted credential storage
Access Control: User-specific credential access
Audit Trail: Complete history of credential usage

Security Best Practices

Parameterized Queries: SQL injection prevention via Prisma
Environment Variables: Secure secret management
HTTPS Enforcement: Secure communication channels
Rate Limiting: Built-in protection against abuse

Analytics & Monitoring

Execution Analytics

Success/Failure Rates: Comprehensive execution statistics
Performance Metrics: Execution time and resource usage tracking
Credit Consumption: Detailed credit usage analytics
Trend Analysis: Historical data and usage patterns

Real-time Monitoring

Live Execution Tracking: Real-time workflow execution monitoring
Progress Indicators: Visual progress tracking for running workflows
Error Reporting: Immediate error notifications and detailed logs
Performance Alerts: Automatic alerts for unusual activity

Dashboard Features

Usage Overview: High-level statistics and trends
Workflow Performance: Individual workflow success rates
Credit Forecasting: Predictive analytics for credit usage
Export Capabilities: Data export for external analysis

Environment Variables

Required Variables

DATABASE_URL=postgresql://username:password@host:port/database
API_SECRET=your-secure-api-secret-for-cron-authentication
ENCRYPTION_SECRET=32-character-hex-key-for-credential-encryption
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_test_your-clerk-public-key
CLERK_SECRET_KEY=sk_test_your-clerk-secret-key
STRIPE_SECRET_KEY=sk_test_your-stripe-secret-key
STRIPE_WEBHOOK_SECRET=whsec_your-stripe-webhook-secret
OPENAI_API_KEY=sk-your-openai-api-key
RESEND_API_KEY=re_your-resend-api-key

Optional Variables

NEXT_PUBLIC_ENABLE_LOCAL_CRON=true
NEXT_PUBLIC_LOCAL_CRON_FREQUENCY=60000
NEXT_PUBLIC_APP_URL=https://your-domain.com

Configuration Notes

ENCRYPTION_SECRET: Must be exactly 32 characters (64 hex digits)
API_SECRET: Used for secure communication between cron endpoints
Local Cron: Enable for development or enhanced scheduling reliability

Screenshots

Login & Signup page

Home Page

Home Page With Chart

Workflow Page

Workflow Page With Action

Workflow Creation

Workflow Scheduler Configuration

Workflow Editor

Workflow Editor Validation

Execution Monitor

All Workflow

Credential Page

Credential creation form

Billing Page

Billing Chart

Transaction History

Payment Invoice

Deployment

Vercel Deployment (Recommended)

Repository Setup: Connect your GitHub repository to Vercel
Environment Configuration: Add all required environment variables in Vercel dashboard
Database Setup: Configure PostgreSQL database (Vercel Postgres recommended)
Deployment: Deploy the application through Vercel interface
Cron Configuration: Verify vercel.json cron configuration is active

Manual Deployment

Build: npm run build
Database: Run migrations in production environment
Environment: Ensure all environment variables are properly set
Start: npm start or use process manager like PM2

Post-Deployment Checklist

Verify database connections
Test authentication flow
Confirm Stripe webhook endpoints
Validate cron job execution
Test workflow creation and execution
Verify email functionality

Contributing

We welcome contributions to Scrape Flow! Please follow these guidelines:

Development Setup

Fork the repository
Create a feature branch: git checkout -b feature/your-feature-name
Make your changes with appropriate tests
Ensure code follows existing style guidelines
Submit a pull request with detailed description

Adding New Task Types

Create task definition in lib/workflow/task/
Add task to registry in lib/workflow/task/registry.tsx
Implement executor in lib/workflow/executor/
Add proper TypeScript types
Include comprehensive tests

Coding Standards

TypeScript: Strict mode enabled, proper type definitions required
Formatting: Prettier and ESLint configurations must be followed
Testing: Unit tests required for new features
Documentation: Update README and inline documentation

Bug Reports

Please include:

Detailed description of the issue
Steps to reproduce
Expected vs actual behavior
Environment details (Node.js version, browser, etc.)
Screenshots if applicable

Development Commands

# Database management
npx prisma migrate dev    # Run database migrations
npx prisma studio        # Open Prisma Studio
npx prisma generate      # Generate Prisma client

# Development
npm run dev             # Start development server
npm run build          # Build for production
npm run start          # Start production server

# Database queries (custom command)
npm run queries        # Run custom database queries

License: MIT License
Maintainer: Harmik Lathiya Support: Create an issue

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md

Lathiya50/scraper-flow-demo

Folders and files

Latest commit

History

Repository files navigation