Skip to content

s24hira/pdf-watermark-remover

 
 

Repository files navigation

PDF Watermark Remover

Removes 'RETRACTED' watermarks from Academic PDF articles.

This tool provides a web interface and a command-line utility to remove watermarks from PDF files. It offers three levels of aggressivity for watermark removal. Higher levels are more aggressive and may cause more changes to the final document, but images and photos embedded in the PDF are always preserved.

Features

  • Web Interface: An easy-to-use interface to upload and clean PDFs.
  • Command-Line Interface: For batch processing and integration into workflows.
  • Multiple Aggressivity Levels: Choose the best watermark removal strategy for your needs.
  • Image Preservation: Images and photos within the PDF are not affected.

Aggressivity Levels

  • Level 1: Removes all PDF stream resources that are explicitly identified as watermarks (e.g., using /Watermark or /Background tags).
  • Level 2 (Default): Includes all removals from Level 1, plus it removes graphical elements that appear more than once across the PDF pages and all instances of the word 'RETRACTED'. Note: For some PDFs, this level might remove the entire text from a page.
  • Level 3: Includes all removals from Levels 1 and 2, and also removes all graphical elements from the PDF.

Web Interface Quick Start

The simplest way to use the PDF Watermark Remover is through its web interface.

1. Installation

Clone the repository and install the required Python packages:

git clone https://github.com/your-username/pdf-watermark-remover.git
cd pdf-watermark-remover
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Run the Web Server

Start the Flask application:

python main.py

Open your web browser and navigate to http://127.0.0.1:5000.

3. Usage

  1. Click the "Upload PDF File" button and select your PDF.
  2. Click "Remove Watermarks".
  3. The cleaned PDF will be automatically downloaded.

Command-Line Quick Start

1. Installation

Follow the same installation steps as for the web interface.

2. Usage

Run the watermark remover from the command line:

python main.py -i <PDF-input> -o <PDF-output> -m [mode of aggressivity]
  • <PDF-input>: Path to your input PDF.
  • <PDF-output>: Path for the cleaned output PDF.
  • [mode of aggressivity]: 1, 2, or 3 (defaults to 2).

For Developers: Frontend Setup with Tailwind CSS

The web interface is built with Flask and styled with Tailwind CSS. If you want to modify the frontend, you'll need to set up the Tailwind CSS development environment.

1. Prerequisites

2. Install Dependencies

Install the necessary npm packages:

npm install

This will install Tailwind CSS, PostCSS, and Autoprefixer, as defined in package.json.

3. Run the Tailwind CSS Build Process

To watch for changes in the CSS and automatically generate the output.css file, run the following command:

npm run build-css

This command, defined in package.json, uses tailwindcss to compile static/style.css into static/output.css. The --watch flag keeps the process running and automatically recompiles when you make changes to your HTML or CSS files.

4. How it Works

  • tailwind.config.js: This file configures Tailwind CSS. The content array tells Tailwind to scan all HTML and JavaScript files in the templates and static directories for class names.
  • postcss.config.js: This file configures PostCSS to use the Tailwind CSS and Autoprefixer plugins.
  • static/style.css: This is the main CSS source file. It includes the base Tailwind CSS styles.
  • static/output.css: This is the generated CSS file that is included in the main HTML template (templates/index.html). Do not edit this file directly, as it is overwritten every time the build-css script is run.

Project Structure

.
├── app/                  # Core application logic (if any)
├── main.py               # Main Flask application and CLI entry point
├── package.json          # Node.js dependencies and scripts for frontend
├── pdf_processing/       # Modules for PDF manipulation
│   ├── watermark_remover.py
│   └── ...
├── requirements.txt      # Python dependencies
├── static/               # Static assets (CSS, JS)
│   ├── style.css         # Source CSS file for Tailwind
│   └── output.css        # Generated CSS file
├── templates/            # HTML templates for Flask
│   └── index.html
├── tailwind.config.js    # Tailwind CSS configuration
└── postcss.config.js     # PostCSS configuration

About

Flask-based simple PDF watermark remover tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 73.6%
  • HTML 15.1%
  • JavaScript 11.1%
  • CSS 0.2%