PDF Research Downloader

A modern, web-based PDF downloader tool that automatically searches and downloads PDF documents from Google search results based on customizable keywords and research fields.

Features

🚀 Web Interface: Modern, responsive web application
📚 Multiple Research Fields: Pre-configured keywords for cybersecurity, AI, and more
🔧 Customizable: Add your own keywords and configure download settings
📊 Real-time Status: Monitor download progress and results
🛡️ Anti-Detection: Built-in mechanisms to avoid Google's bot detection
📁 Organized Storage: Automatic file organization by field and keyword
⚙️ Configurable: Environment-based configuration for easy deployment

Project Structure

pdf-downloader/
├── app.py                 # Flask web application
├── config.py             # Configuration management
├── requirements.txt      # Python dependencies
├── README.md            # This file
├── env_example.txt      # Environment variables example
├── google_Download.py   # Original script (for reference)
├── src/
│   ├── __init__.py
│   └── pdf_downloader.py # Core PDF downloader class
└── templates/
    └── index.html       # Web interface

Installation

Prerequisites

Python 3.8 or higher
Chrome browser installed
Git (for cloning)

Setup

Clone the repository

git clone https://github.com/dreamjet31/pdf-research-google-downloade
cd pdf-research-google-downloader

Create virtual environment

python -m venv venv

# On Windows
venv\Scripts\activate

# On macOS/Linux
source venv/bin/activate

Install dependencies
```
pip install -r requirements.txt
```

Configure environment variables

# Copy the example file
cp env_example.txt .env

# Edit .env with your settings
nano .env

Run the application
```
python app.py
```
Access the web interface Open your browser and go to: http://localhost:5000

Configuration

Environment Variables

Copy env_example.txt to .env and customize the settings:

# Application Settings
DEBUG=False
SECRET_KEY=your-secret-key-change-this-in-production

# Download Settings
MAX_PDF_PER_KEYWORD=200
MAX_PAGES_PER_SEARCH=3

# Timing Settings (in seconds)
MIN_SLEEP_TIME=2
MAX_SLEEP_TIME=5
PAGE_LOAD_TIMEOUT=10
REQUEST_TIMEOUT=15

# Browser Settings
HEADLESS_MODE=False
USER_AGENT_ROTATION=True

# File Storage
BASE_DOWNLOAD_DIR=downloads

Adding Custom Keywords

You can add custom keywords in two ways:

Through the web interface: Use the "Custom Keywords" section
In the configuration: Edit config.py and add to DEFAULT_FIELDS_KEYWORDS

Usage

Web Interface

Select Research Fields: Choose from pre-configured fields or add custom keywords
Configure Settings: Adjust max PDFs per keyword and search pages
Start Download: Click "Start Download" to begin the process
Monitor Progress: Watch real-time status updates
View Results: See download statistics and results

API Endpoints

GET / - Main web interface
GET /api/fields - Get available fields and keywords
POST /api/start-download - Start download process
GET /api/download-status - Get current download status
GET /api/downloads - List downloaded files
GET /api/config - Get current configuration
GET /api/health - Health check

Programmatic Usage

from src.pdf_downloader import PDFDownloader
from config import Config

# Initialize downloader
downloader = PDFDownloader()

# Download for specific keywords
result = downloader.download_single_keyword(
    keyword="machine learning security",
    field_name="custom",
    max_pdfs=50
)

# Download for entire field
results = downloader.download_pdfs_for_field(
    field_name="cybersecurity",
    keywords=["network security", "data protection"],
    max_pdfs_per_keyword=100
)

Deployment

Local Development

python app.py

Production Deployment

Using Gunicorn

pip install gunicorn
gunicorn -w 4 -b 0.0.0.0:5000 app:app

Using Docker

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
EXPOSE 5000

CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "app:app"]

Cloud Deployment
- Heroku: Add Procfile with web: gunicorn app:app
- AWS: Use Elastic Beanstalk or EC2
- Google Cloud: Use App Engine or Compute Engine
- Azure: Use App Service

Environment Variables for Production

DEBUG=False
SECRET_KEY=<generate-secure-random-key>
HEADLESS_MODE=True
USER_AGENT_ROTATION=True

Security Considerations

⚠️ Rate Limiting: The tool includes random delays to avoid detection
🔒 User Agent Rotation: Automatically rotates user agents
🛡️ Anti-Detection: Built-in mechanisms to avoid Google's bot detection
📝 Logging: Comprehensive logging for monitoring and debugging

Troubleshooting

Common Issues

Chrome Driver Issues

# Update Chrome browser
# The tool automatically downloads the correct ChromeDriver version

Robot Detection
- The tool will pause and wait for manual intervention
- Complete the CAPTCHA and press Enter to continue

Permission Errors

# Ensure write permissions to download directory
chmod 755 downloads/

Memory Issues
- Reduce MAX_PDF_PER_KEYWORD in configuration
- Use HEADLESS_MODE=True for better performance

Logs

Check the console output for detailed logs. The application provides real-time status updates through the web interface.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This tool is for educational and research purposes only. Please respect website terms of service and robots.txt files. The authors are not responsible for any misuse of this software.

Support

For issues and questions:

Check the troubleshooting section
Review the logs
Create an issue on GitHub
Contact the maintainers

Happy Researching! 📚🔍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Research Downloader

Features

Project Structure

Installation

Prerequisites

Setup

Configuration

Environment Variables

Adding Custom Keywords

Usage

Web Interface

API Endpoints

Programmatic Usage

Deployment

Local Development

Production Deployment

Environment Variables for Production

Security Considerations

Troubleshooting

Common Issues

Logs

Contributing

License

Disclaimer

Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
templates		templates
.gitignore		.gitignore
Dockerfile		Dockerfile
Procfile		Procfile
README.md		README.md
app.py		app.py
cli.py		cli.py
config.py		config.py
docker-compose.yml		docker-compose.yml
env_example.txt		env_example.txt
google_Download.py		google_Download.py
requirements.txt		requirements.txt
test_setup.py		test_setup.py

tim-syntax/google-downloader-research

Folders and files

Latest commit

History

Repository files navigation

PDF Research Downloader

Features

Project Structure

Installation

Prerequisites

Setup

Configuration

Environment Variables

Adding Custom Keywords

Usage

Web Interface

API Endpoints

Programmatic Usage

Deployment

Local Development

Production Deployment

Environment Variables for Production

Security Considerations

Troubleshooting

Common Issues

Logs

Contributing

License

Disclaimer

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages