A demonstration project showing how to build an ethical web scraping application using Puppeteer in a Docker container, deployed on DigitalOcean App Platform.
- Search through Project Gutenberg's vast library of public domain books
- View detailed book information including:
- Title and author
- Publication date
- Download statistics
- Available formats
- Subject categories
- Implements proper rate limiting (1 request per second)
- Respects robots.txt guidelines
- Clean, responsive UI built with Bootstrap
- Containerized with Docker
- Ready for deployment on DigitalOcean App Platform
- Node.js
- Express.js
- Puppeteer for web scraping
- EJS for templating
- Bootstrap for UI
- Docker for containerization
-
Ethical Web Scraping
- Proper rate limiting (1 request/second)
- Descriptive User-Agent
- Respect for robots.txt
- Error handling and retries
-
Docker Best Practices
- Multi-stage builds
- Non-root user
- Security considerations
- Resource optimization
-
Code Organization
- Modular architecture
- Service-based structure
- Clean separation of concerns
- Error handling
-
Clone the repository:
git clone https://github.com/yourusername/gutenberg-scraper.git cd gutenberg-scraper
-
Install dependencies:
npm install
-
Run in development mode:
npm run dev
-
Visit
http://localhost:8080
-
Build the container:
docker build -t gutenberg-scraper -f dockerfiles/Dockerfile .
-
Run the container:
docker run -p 8080:8080 gutenberg-scraper
-
Fork this repository to your GitHub account
-
In the DigitalOcean Console:
- Create a new app
- Select your forked repository
- Choose the Docker source
- Configure environment variables if needed
- Deploy!
PORT
: Application port (default: 8080)NODE_ENV
: Environment setting (development/production)
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project serves as an example for several important concepts:
-
Web Scraping Ethics
- Rate limiting implementation
- Proper user agent identification
- Respecting website terms of service
-
Docker Containerization
- Multi-stage builds
- Security considerations
- Resource optimization
-
Cloud Deployment
- DigitalOcean App Platform setup
- Container orchestration
- Environment configuration
ISC License - See LICENSE file for details