Scrape Flow is a comprehensive web scraping and automation platform that allows users to create, schedule, and manage sophisticated web scraping workflows with an intuitive visual interface. This application provides a powerful yet user-friendly way to extract data from websites, process it with AI, perform complex data transformations, and deliver results through multiple channels including webhooks, email, and database storage.
- Features
- Task Types
- Technology Stack
- Getting Started
- Architecture
- Workflow System
- Scheduling & Automation
- Credit System & Billing
- Security & Credentials
- Analytics & Monitoring
- Environment Variables
- Screenshots
- Deployment
- Contributing
- Visual Workflow Builder: Drag-and-drop interface for creating complex scraping workflows
- 22 Task Types: Comprehensive set of tasks across 7 categories for all automation needs
- AI-Powered Extraction: Intelligent data extraction using AI models
- Real-time Execution: Live workflow execution with detailed logging and monitoring
- Advanced Scheduling: Hybrid cron system supporting both simple and complex schedules
- Credit Management: Transparent usage-based billing with Stripe integration
- Secure Credential Storage: Encrypted storage for API keys and authentication data
- Multi-format Output: JSON, CSV, and custom data transformation support
- Conditional Logic: Dynamic workflow paths based on data conditions
- Loop Processing: Iterate over data sets with configurable loop controls
- Database Integration: Direct database queries and data storage
- Email Automation: Send rich HTML/text emails with attachments
- File Downloads: Automated file downloading and processing
- Screenshot Capture: Full-page and viewport screenshots with quality controls
- Table Extraction: Structured data extraction from HTML tables
- Data Transformation: Built-in data processing and formatting tools
- Scalable Architecture: Built for high-volume data processing
- Analytics Dashboard: Comprehensive execution statistics and credit usage tracking
- User Management: Multi-user support with Clerk authentication
- Rate Limiting: Built-in protection against abuse
- Webhook Integration: Real-time data delivery to external systems
- Error Handling: Robust error recovery and retry mechanisms
- Launch Browser (5 credits): Initialize browser and navigate to websites
- Navigate URL (2 credits): Navigate to specific URLs within workflows
- Page to HTML (2 credits): Extract full HTML content from web pages
- Wait for Element (1 credit): Wait for specific elements to appear/disappear
- Take Screenshot (3 credits): Capture full-page or viewport screenshots
- Click Element (3 credits): Interact with clickable elements on pages
- Fill Input (1 credit): Fill form inputs with dynamic data
- Scroll to Element (1 credit): Scroll to specific elements on pages
- Wait Delay (1 credit): Add controlled delays in workflow execution
- Extract Text from Element (2 credits): Extract text content using CSS selectors
- Extract Table Data (3 credits): Extract structured data from HTML tables
- Extract Data with AI (4 credits): AI-powered intelligent data extraction
- Read Property from JSON (1 credit): Extract specific properties from JSON data
- Add Property to JSON (1 credit): Add new properties to JSON objects
- Data Transform (2 credits): Advanced data transformation and formatting
- Conditional Logic (1 credit): Create dynamic workflow paths
- Loop (2 credits): Iterate over data collections with customizable controls
- Send Email (3 credits): Send HTML/text emails with attachment support
- Deliver via Webhook (1 credit): Send data to external APIs and services
- Download File (3 credits): Download and process files from URLs
- Database Query (2 credits): Execute SQL queries and store results
- Extract Data with AI (4 credits): Leverage AI for complex data extraction scenarios
- Framework: Next.js 14 (App Router)
- UI Components: Shadcn UI (built on Radix UI)
- Styling: Tailwind CSS
- State Management: TanStack Query (React Query)
- Forms: React Hook Form with Zod validation
- Data Visualization: Recharts
- Flow Editor: XYFlow (React Flow)
- API: Next.js Server Actions and API Routes
- Database ORM: Prisma
- Authentication: Clerk
- Payments: Stripe
- Scheduling: Custom cron scheduler with hybrid approach
- Web Scraping: Puppeteer/Chromium
- AI Integration: OpenAI
- Email Service: Resend
- Primary Database: PostgreSQL
- Schema Management: Prisma Migrations
- Query Builder: Custom query execution for Database Query tasks
- Hosting: Vercel
- Cron Jobs: Vercel Cron + Browser-based hybrid system
- File Storage: Built-in file download and processing
- Encryption: AES-256-CBC for credential storage
Experience Scrape Flow in action at our Live Demo
- Node.js (v18 or higher)
- PostgreSQL database
- Clerk account (for authentication)
- Stripe account (for payments)
- OpenAI API key (for AI features)
- Resend API key (for email functionality)
- Clone the repository:
git clone https://github.com/your-username/scrape-flow.git
cd scrape-flow
- Install dependencies:
npm install
-
Set up environment variables: Create a
.env
file in the root directory and add the necessary environment variables (see Environment Variables section) -
Run database migrations:
npx prisma migrate dev
- Start the development server:
npm run dev
The application features a sophisticated workflow execution engine that handles:
- Task Registry: Dynamic loading and validation of all task types
- Execution Context: Isolated execution environments for each workflow run
- Credit Tracking: Real-time credit consumption monitoring
- Error Recovery: Automatic retry mechanisms and graceful error handling
- Progress Tracking: Live execution progress updates
- Workflow Definition: Users create workflows using the visual editor
- Validation: Workflows are validated for completeness and credit requirements
- Execution: Tasks are executed sequentially with data passing between them
- Monitoring: Real-time execution monitoring with detailed logging
- Results: Data is delivered through configured output channels
- Drag-and-Drop Interface: Intuitive task placement and connection
- Real-time Validation: Immediate feedback on workflow completeness
- Task Configuration: Detailed parameter configuration for each task
- Preview Mode: Test workflows without consuming credits
- Export/Import: Save and share workflow definitions
- Sequential Processing: Tasks execute in defined order
- Data Persistence: Workflow state maintained throughout execution
- Credit Management: Pre-execution credit validation and real-time tracking
- Error Handling: Graceful failure recovery with detailed error messages
- Execution History: Complete audit trail of all workflow runs
To overcome platform limitations, Scrape Flow implements a unique hybrid scheduling approach:
- Daily cron job configured in
vercel.json
- Serves as a backup execution trigger
- Ensures workflows are checked at least once daily
- Client-side component running in user browsers
- Polls
/api/workflows/cron
endpoint at configurable intervals - Works in both development and production environments
- Configurable polling frequency (default: 60 seconds)
- Seconds:
30s
(every 30 seconds) - Minutes:
5m
(every 5 minutes) - Hours:
2h
(every 2 hours) - Days:
1d
(daily)
- Standard Cron:
0 9 * * 1-5
(weekdays at 9am) - Complex Patterns:
*/15 * * * *
(every 15 minutes) - Custom Schedules: Full cron expression support
- Exponential Backoff: Automatic retry with increasing delays
- Cache Prevention: Timestamp-based cache busting
- Error Recovery: Continues operation during temporary failures
- Resource Optimization: Minimal API calls when browser is inactive
- Transparent Pricing: Each task type has a defined credit cost
- Pre-execution Validation: Workflows validate credit requirements before running
- Real-time Tracking: Live credit consumption monitoring
- Usage Analytics: Detailed breakdown of credit usage by task type
- Stripe Integration: Secure payment processing
- Flexible Plans: Multiple credit packages available
- Auto-refill: Optional automatic credit top-up
- Transaction History: Complete billing history and receipts
- Usage Forecasting: Predict credit needs based on workflow schedules
- High Cost (5 credits): Launch Browser
- Medium Cost (3-4 credits): AI Extraction, Screenshots, File Downloads, Email Sending
- Standard Cost (2 credits): Data extraction, Database queries, Navigation
- Low Cost (1 credit): Simple operations, waits, property manipulation
- Clerk Integration: Secure user authentication and session management
- Protected Routes: Middleware-based route protection
- API Security: Secure API endpoints with proper authentication
- User Isolation: Complete data separation between users
- AES-256-CBC Encryption: Military-grade encryption for stored credentials
- Secure Storage: Database-encrypted credential storage
- Access Control: User-specific credential access
- Audit Trail: Complete history of credential usage
- Parameterized Queries: SQL injection prevention via Prisma
- Environment Variables: Secure secret management
- HTTPS Enforcement: Secure communication channels
- Rate Limiting: Built-in protection against abuse
- Success/Failure Rates: Comprehensive execution statistics
- Performance Metrics: Execution time and resource usage tracking
- Credit Consumption: Detailed credit usage analytics
- Trend Analysis: Historical data and usage patterns
- Live Execution Tracking: Real-time workflow execution monitoring
- Progress Indicators: Visual progress tracking for running workflows
- Error Reporting: Immediate error notifications and detailed logs
- Performance Alerts: Automatic alerts for unusual activity
- Usage Overview: High-level statistics and trends
- Workflow Performance: Individual workflow success rates
- Credit Forecasting: Predictive analytics for credit usage
- Export Capabilities: Data export for external analysis
DATABASE_URL=postgresql://username:password@host:port/database
API_SECRET=your-secure-api-secret-for-cron-authentication
ENCRYPTION_SECRET=32-character-hex-key-for-credential-encryption
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_test_your-clerk-public-key
CLERK_SECRET_KEY=sk_test_your-clerk-secret-key
STRIPE_SECRET_KEY=sk_test_your-stripe-secret-key
STRIPE_WEBHOOK_SECRET=whsec_your-stripe-webhook-secret
OPENAI_API_KEY=sk-your-openai-api-key
RESEND_API_KEY=re_your-resend-api-key
NEXT_PUBLIC_ENABLE_LOCAL_CRON=true
NEXT_PUBLIC_LOCAL_CRON_FREQUENCY=60000
NEXT_PUBLIC_APP_URL=https://your-domain.com
- ENCRYPTION_SECRET: Must be exactly 32 characters (64 hex digits)
- API_SECRET: Used for secure communication between cron endpoints
- Local Cron: Enable for development or enhanced scheduling reliability
- Repository Setup: Connect your GitHub repository to Vercel
- Environment Configuration: Add all required environment variables in Vercel dashboard
- Database Setup: Configure PostgreSQL database (Vercel Postgres recommended)
- Deployment: Deploy the application through Vercel interface
- Cron Configuration: Verify
vercel.json
cron configuration is active
- Build:
npm run build
- Database: Run migrations in production environment
- Environment: Ensure all environment variables are properly set
- Start:
npm start
or use process manager like PM2
- Verify database connections
- Test authentication flow
- Confirm Stripe webhook endpoints
- Validate cron job execution
- Test workflow creation and execution
- Verify email functionality
We welcome contributions to Scrape Flow! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature-name
- Make your changes with appropriate tests
- Ensure code follows existing style guidelines
- Submit a pull request with detailed description
- Create task definition in
lib/workflow/task/
- Add task to registry in
lib/workflow/task/registry.tsx
- Implement executor in
lib/workflow/executor/
- Add proper TypeScript types
- Include comprehensive tests
- TypeScript: Strict mode enabled, proper type definitions required
- Formatting: Prettier and ESLint configurations must be followed
- Testing: Unit tests required for new features
- Documentation: Update README and inline documentation
Please include:
- Detailed description of the issue
- Steps to reproduce
- Expected vs actual behavior
- Environment details (Node.js version, browser, etc.)
- Screenshots if applicable
# Database management
npx prisma migrate dev # Run database migrations
npx prisma studio # Open Prisma Studio
npx prisma generate # Generate Prisma client
# Development
npm run dev # Start development server
npm run build # Build for production
npm run start # Start production server
# Database queries (custom command)
npm run queries # Run custom database queries
License: MIT License
Maintainer: Harmik Lathiya
Support: Create an issue