Skip to content

Lathiya50/scraper-flow-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

Scrape Flow

Scrape Flow is a comprehensive web scraping and automation platform that allows users to create, schedule, and manage sophisticated web scraping workflows with an intuitive visual interface. This application provides a powerful yet user-friendly way to extract data from websites, process it with AI, perform complex data transformations, and deliver results through multiple channels including webhooks, email, and database storage.

Table of Contents

Features

Core Capabilities

  • Visual Workflow Builder: Drag-and-drop interface for creating complex scraping workflows
  • 22 Task Types: Comprehensive set of tasks across 7 categories for all automation needs
  • AI-Powered Extraction: Intelligent data extraction using AI models
  • Real-time Execution: Live workflow execution with detailed logging and monitoring
  • Advanced Scheduling: Hybrid cron system supporting both simple and complex schedules
  • Credit Management: Transparent usage-based billing with Stripe integration
  • Secure Credential Storage: Encrypted storage for API keys and authentication data
  • Multi-format Output: JSON, CSV, and custom data transformation support

Advanced Features

  • Conditional Logic: Dynamic workflow paths based on data conditions
  • Loop Processing: Iterate over data sets with configurable loop controls
  • Database Integration: Direct database queries and data storage
  • Email Automation: Send rich HTML/text emails with attachments
  • File Downloads: Automated file downloading and processing
  • Screenshot Capture: Full-page and viewport screenshots with quality controls
  • Table Extraction: Structured data extraction from HTML tables
  • Data Transformation: Built-in data processing and formatting tools

Enterprise Features

  • Scalable Architecture: Built for high-volume data processing
  • Analytics Dashboard: Comprehensive execution statistics and credit usage tracking
  • User Management: Multi-user support with Clerk authentication
  • Rate Limiting: Built-in protection against abuse
  • Webhook Integration: Real-time data delivery to external systems
  • Error Handling: Robust error recovery and retry mechanisms

Task Types

Browser Automation (5 Tasks)

  • Launch Browser (5 credits): Initialize browser and navigate to websites
  • Navigate URL (2 credits): Navigate to specific URLs within workflows
  • Page to HTML (2 credits): Extract full HTML content from web pages
  • Wait for Element (1 credit): Wait for specific elements to appear/disappear
  • Take Screenshot (3 credits): Capture full-page or viewport screenshots

Element Interaction (4 Tasks)

  • Click Element (3 credits): Interact with clickable elements on pages
  • Fill Input (1 credit): Fill form inputs with dynamic data
  • Scroll to Element (1 credit): Scroll to specific elements on pages
  • Wait Delay (1 credit): Add controlled delays in workflow execution

Data Extraction (3 Tasks)

  • Extract Text from Element (2 credits): Extract text content using CSS selectors
  • Extract Table Data (3 credits): Extract structured data from HTML tables
  • Extract Data with AI (4 credits): AI-powered intelligent data extraction

Data Processing (4 Tasks)

  • Read Property from JSON (1 credit): Extract specific properties from JSON data
  • Add Property to JSON (1 credit): Add new properties to JSON objects
  • Data Transform (2 credits): Advanced data transformation and formatting
  • Conditional Logic (1 credit): Create dynamic workflow paths

Control Flow (1 Task)

  • Loop (2 credits): Iterate over data collections with customizable controls

Communication (2 Tasks)

  • Send Email (3 credits): Send HTML/text emails with attachment support
  • Deliver via Webhook (1 credit): Send data to external APIs and services

Storage & Database (2 Tasks)

  • Download File (3 credits): Download and process files from URLs
  • Database Query (2 credits): Execute SQL queries and store results

AI & Advanced (1 Task)

  • Extract Data with AI (4 credits): Leverage AI for complex data extraction scenarios

Technology Stack

Frontend

  • Framework: Next.js 14 (App Router)
  • UI Components: Shadcn UI (built on Radix UI)
  • Styling: Tailwind CSS
  • State Management: TanStack Query (React Query)
  • Forms: React Hook Form with Zod validation
  • Data Visualization: Recharts
  • Flow Editor: XYFlow (React Flow)

Backend

  • API: Next.js Server Actions and API Routes
  • Database ORM: Prisma
  • Authentication: Clerk
  • Payments: Stripe
  • Scheduling: Custom cron scheduler with hybrid approach
  • Web Scraping: Puppeteer/Chromium
  • AI Integration: OpenAI
  • Email Service: Resend

Database

  • Primary Database: PostgreSQL
  • Schema Management: Prisma Migrations
  • Query Builder: Custom query execution for Database Query tasks

Infrastructure

  • Hosting: Vercel
  • Cron Jobs: Vercel Cron + Browser-based hybrid system
  • File Storage: Built-in file download and processing
  • Encryption: AES-256-CBC for credential storage

Live Demo

Experience Scrape Flow in action at our Live Demo

Getting Started

Prerequisites

  • Node.js (v18 or higher)
  • PostgreSQL database
  • Clerk account (for authentication)
  • Stripe account (for payments)
  • OpenAI API key (for AI features)
  • Resend API key (for email functionality)

Installation

  1. Clone the repository:
git clone https://github.com/your-username/scrape-flow.git
cd scrape-flow
  1. Install dependencies:
npm install
  1. Set up environment variables: Create a .env file in the root directory and add the necessary environment variables (see Environment Variables section)

  2. Run database migrations:

npx prisma migrate dev
  1. Start the development server:
npm run dev

Architecture

Workflow Execution Engine

The application features a sophisticated workflow execution engine that handles:

  • Task Registry: Dynamic loading and validation of all task types
  • Execution Context: Isolated execution environments for each workflow run
  • Credit Tracking: Real-time credit consumption monitoring
  • Error Recovery: Automatic retry mechanisms and graceful error handling
  • Progress Tracking: Live execution progress updates

Data Flow

  1. Workflow Definition: Users create workflows using the visual editor
  2. Validation: Workflows are validated for completeness and credit requirements
  3. Execution: Tasks are executed sequentially with data passing between them
  4. Monitoring: Real-time execution monitoring with detailed logging
  5. Results: Data is delivered through configured output channels

Workflow System

Visual Workflow Editor

  • Drag-and-Drop Interface: Intuitive task placement and connection
  • Real-time Validation: Immediate feedback on workflow completeness
  • Task Configuration: Detailed parameter configuration for each task
  • Preview Mode: Test workflows without consuming credits
  • Export/Import: Save and share workflow definitions

Execution Engine

  • Sequential Processing: Tasks execute in defined order
  • Data Persistence: Workflow state maintained throughout execution
  • Credit Management: Pre-execution credit validation and real-time tracking
  • Error Handling: Graceful failure recovery with detailed error messages
  • Execution History: Complete audit trail of all workflow runs

Scheduling & Automation

Hybrid Scheduling System

To overcome platform limitations, Scrape Flow implements a unique hybrid scheduling approach:

Vercel Cron Integration

  • Daily cron job configured in vercel.json
  • Serves as a backup execution trigger
  • Ensures workflows are checked at least once daily

Browser-Based Local Cron

  • Client-side component running in user browsers
  • Polls /api/workflows/cron endpoint at configurable intervals
  • Works in both development and production environments
  • Configurable polling frequency (default: 60 seconds)

Schedule Types

Simple Intervals

  • Seconds: 30s (every 30 seconds)
  • Minutes: 5m (every 5 minutes)
  • Hours: 2h (every 2 hours)
  • Days: 1d (daily)

Cron Expressions

  • Standard Cron: 0 9 * * 1-5 (weekdays at 9am)
  • Complex Patterns: */15 * * * * (every 15 minutes)
  • Custom Schedules: Full cron expression support

Resilience Features

  • Exponential Backoff: Automatic retry with increasing delays
  • Cache Prevention: Timestamp-based cache busting
  • Error Recovery: Continues operation during temporary failures
  • Resource Optimization: Minimal API calls when browser is inactive

Credit System & Billing

Credit-Based Usage

  • Transparent Pricing: Each task type has a defined credit cost
  • Pre-execution Validation: Workflows validate credit requirements before running
  • Real-time Tracking: Live credit consumption monitoring
  • Usage Analytics: Detailed breakdown of credit usage by task type

Billing Integration

  • Stripe Integration: Secure payment processing
  • Flexible Plans: Multiple credit packages available
  • Auto-refill: Optional automatic credit top-up
  • Transaction History: Complete billing history and receipts
  • Usage Forecasting: Predict credit needs based on workflow schedules

Credit Costs by Task Type

  • High Cost (5 credits): Launch Browser
  • Medium Cost (3-4 credits): AI Extraction, Screenshots, File Downloads, Email Sending
  • Standard Cost (2 credits): Data extraction, Database queries, Navigation
  • Low Cost (1 credit): Simple operations, waits, property manipulation

Security & Credentials

Authentication & Authorization

  • Clerk Integration: Secure user authentication and session management
  • Protected Routes: Middleware-based route protection
  • API Security: Secure API endpoints with proper authentication
  • User Isolation: Complete data separation between users

Credential Management

  • AES-256-CBC Encryption: Military-grade encryption for stored credentials
  • Secure Storage: Database-encrypted credential storage
  • Access Control: User-specific credential access
  • Audit Trail: Complete history of credential usage

Security Best Practices

  • Parameterized Queries: SQL injection prevention via Prisma
  • Environment Variables: Secure secret management
  • HTTPS Enforcement: Secure communication channels
  • Rate Limiting: Built-in protection against abuse

Analytics & Monitoring

Execution Analytics

  • Success/Failure Rates: Comprehensive execution statistics
  • Performance Metrics: Execution time and resource usage tracking
  • Credit Consumption: Detailed credit usage analytics
  • Trend Analysis: Historical data and usage patterns

Real-time Monitoring

  • Live Execution Tracking: Real-time workflow execution monitoring
  • Progress Indicators: Visual progress tracking for running workflows
  • Error Reporting: Immediate error notifications and detailed logs
  • Performance Alerts: Automatic alerts for unusual activity

Dashboard Features

  • Usage Overview: High-level statistics and trends
  • Workflow Performance: Individual workflow success rates
  • Credit Forecasting: Predictive analytics for credit usage
  • Export Capabilities: Data export for external analysis

Environment Variables

Required Variables

DATABASE_URL=postgresql://username:password@host:port/database
API_SECRET=your-secure-api-secret-for-cron-authentication
ENCRYPTION_SECRET=32-character-hex-key-for-credential-encryption
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_test_your-clerk-public-key
CLERK_SECRET_KEY=sk_test_your-clerk-secret-key
STRIPE_SECRET_KEY=sk_test_your-stripe-secret-key
STRIPE_WEBHOOK_SECRET=whsec_your-stripe-webhook-secret
OPENAI_API_KEY=sk-your-openai-api-key
RESEND_API_KEY=re_your-resend-api-key

Optional Variables

NEXT_PUBLIC_ENABLE_LOCAL_CRON=true
NEXT_PUBLIC_LOCAL_CRON_FREQUENCY=60000
NEXT_PUBLIC_APP_URL=https://your-domain.com

Configuration Notes

  • ENCRYPTION_SECRET: Must be exactly 32 characters (64 hex digits)
  • API_SECRET: Used for secure communication between cron endpoints
  • Local Cron: Enable for development or enhanced scheduling reliability

Screenshots

Login & Signup page

SignIn

Home Page

HomePage-1

Home Page With Chart

HomePage-2

Workflow Page

WorkFlow

Workflow Page With Action

WorkFlowAction

Workflow Creation

WorkFlowCreationForm

Workflow Scheduler Configuration

WorkFlowCronJob

Workflow Editor

WorkFlowCreation-1

Workflow Editor Validation

WorkFlowCreation-2

Execution Monitor

WorkFlowSucess

All Workflow

AllWorkFlow

Credential Page

image

Credential creation form

image

Billing Page

Billing-1

Billing Chart

Billing-2

Transaction History

Billing-3

Payment Invoice

image

Deployment

Vercel Deployment (Recommended)

  1. Repository Setup: Connect your GitHub repository to Vercel
  2. Environment Configuration: Add all required environment variables in Vercel dashboard
  3. Database Setup: Configure PostgreSQL database (Vercel Postgres recommended)
  4. Deployment: Deploy the application through Vercel interface
  5. Cron Configuration: Verify vercel.json cron configuration is active

Manual Deployment

  1. Build: npm run build
  2. Database: Run migrations in production environment
  3. Environment: Ensure all environment variables are properly set
  4. Start: npm start or use process manager like PM2

Post-Deployment Checklist

  • Verify database connections
  • Test authentication flow
  • Confirm Stripe webhook endpoints
  • Validate cron job execution
  • Test workflow creation and execution
  • Verify email functionality

Contributing

We welcome contributions to Scrape Flow! Please follow these guidelines:

Development Setup

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature-name
  3. Make your changes with appropriate tests
  4. Ensure code follows existing style guidelines
  5. Submit a pull request with detailed description

Adding New Task Types

  1. Create task definition in lib/workflow/task/
  2. Add task to registry in lib/workflow/task/registry.tsx
  3. Implement executor in lib/workflow/executor/
  4. Add proper TypeScript types
  5. Include comprehensive tests

Coding Standards

  • TypeScript: Strict mode enabled, proper type definitions required
  • Formatting: Prettier and ESLint configurations must be followed
  • Testing: Unit tests required for new features
  • Documentation: Update README and inline documentation

Bug Reports

Please include:

  • Detailed description of the issue
  • Steps to reproduce
  • Expected vs actual behavior
  • Environment details (Node.js version, browser, etc.)
  • Screenshots if applicable

Development Commands

# Database management
npx prisma migrate dev    # Run database migrations
npx prisma studio        # Open Prisma Studio
npx prisma generate      # Generate Prisma client

# Development
npm run dev             # Start development server
npm run build          # Build for production
npm run start          # Start production server

# Database queries (custom command)
npm run queries        # Run custom database queries

License: MIT License
Maintainer: Harmik Lathiya Support: Create an issue

About

ScraperFlow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published