A modern, web-based PDF downloader tool that automatically searches and downloads PDF documents from Google search results based on customizable keywords and research fields.
- π Web Interface: Modern, responsive web application
- π Multiple Research Fields: Pre-configured keywords for cybersecurity, AI, and more
- π§ Customizable: Add your own keywords and configure download settings
- π Real-time Status: Monitor download progress and results
- π‘οΈ Anti-Detection: Built-in mechanisms to avoid Google's bot detection
- π Organized Storage: Automatic file organization by field and keyword
- βοΈ Configurable: Environment-based configuration for easy deployment
pdf-downloader/
βββ app.py # Flask web application
βββ config.py # Configuration management
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ env_example.txt # Environment variables example
βββ google_Download.py # Original script (for reference)
βββ src/
β βββ __init__.py
β βββ pdf_downloader.py # Core PDF downloader class
βββ templates/
βββ index.html # Web interface
- Python 3.8 or higher
- Chrome browser installed
- Git (for cloning)
-
Clone the repository
git clone https://github.com/dreamjet31/pdf-research-google-downloade cd pdf-research-google-downloader
-
Create virtual environment
python -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
# Copy the example file cp env_example.txt .env # Edit .env with your settings nano .env
-
Run the application
python app.py
-
Access the web interface Open your browser and go to:
http://localhost:5000
Copy env_example.txt
to .env
and customize the settings:
# Application Settings
DEBUG=False
SECRET_KEY=your-secret-key-change-this-in-production
# Download Settings
MAX_PDF_PER_KEYWORD=200
MAX_PAGES_PER_SEARCH=3
# Timing Settings (in seconds)
MIN_SLEEP_TIME=2
MAX_SLEEP_TIME=5
PAGE_LOAD_TIMEOUT=10
REQUEST_TIMEOUT=15
# Browser Settings
HEADLESS_MODE=False
USER_AGENT_ROTATION=True
# File Storage
BASE_DOWNLOAD_DIR=downloads
You can add custom keywords in two ways:
- Through the web interface: Use the "Custom Keywords" section
- In the configuration: Edit
config.py
and add toDEFAULT_FIELDS_KEYWORDS
- Select Research Fields: Choose from pre-configured fields or add custom keywords
- Configure Settings: Adjust max PDFs per keyword and search pages
- Start Download: Click "Start Download" to begin the process
- Monitor Progress: Watch real-time status updates
- View Results: See download statistics and results
GET /
- Main web interfaceGET /api/fields
- Get available fields and keywordsPOST /api/start-download
- Start download processGET /api/download-status
- Get current download statusGET /api/downloads
- List downloaded filesGET /api/config
- Get current configurationGET /api/health
- Health check
from src.pdf_downloader import PDFDownloader
from config import Config
# Initialize downloader
downloader = PDFDownloader()
# Download for specific keywords
result = downloader.download_single_keyword(
keyword="machine learning security",
field_name="custom",
max_pdfs=50
)
# Download for entire field
results = downloader.download_pdfs_for_field(
field_name="cybersecurity",
keywords=["network security", "data protection"],
max_pdfs_per_keyword=100
)
python app.py
-
Using Gunicorn
pip install gunicorn gunicorn -w 4 -b 0.0.0.0:5000 app:app
-
Using Docker
FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . EXPOSE 5000 CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "app:app"]
-
Cloud Deployment
- Heroku: Add
Procfile
withweb: gunicorn app:app
- AWS: Use Elastic Beanstalk or EC2
- Google Cloud: Use App Engine or Compute Engine
- Azure: Use App Service
- Heroku: Add
DEBUG=False
SECRET_KEY=<generate-secure-random-key>
HEADLESS_MODE=True
USER_AGENT_ROTATION=True
β οΈ Rate Limiting: The tool includes random delays to avoid detection- π User Agent Rotation: Automatically rotates user agents
- π‘οΈ Anti-Detection: Built-in mechanisms to avoid Google's bot detection
- π Logging: Comprehensive logging for monitoring and debugging
-
Chrome Driver Issues
# Update Chrome browser # The tool automatically downloads the correct ChromeDriver version
-
Robot Detection
- The tool will pause and wait for manual intervention
- Complete the CAPTCHA and press Enter to continue
-
Permission Errors
# Ensure write permissions to download directory chmod 755 downloads/
-
Memory Issues
- Reduce
MAX_PDF_PER_KEYWORD
in configuration - Use
HEADLESS_MODE=True
for better performance
- Reduce
Check the console output for detailed logs. The application provides real-time status updates through the web interface.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This tool is for educational and research purposes only. Please respect website terms of service and robots.txt files. The authors are not responsible for any misuse of this software.
For issues and questions:
- Check the troubleshooting section
- Review the logs
- Create an issue on GitHub
- Contact the maintainers
Happy Researching! ππ