PDF Watermark Remover

Removes 'RETRACTED' watermarks from Academic PDF articles.

This software has three levels of aggressivity; as higher the level more damage it can cause to the final result.
However, even for the maximum level of aggressivity, the images/photos embedded in the PDF would still be preserved.

-- Level 1: All PDF stream resources that explicitly contain the information saying that it is a Watermark are removed.

-- Level 2 (Default): All Watermarks from level 1 and graphical element that appear more than once along the PDF pages are removed. In addition, all 'RETRACTED' words are also removed. For some few PDFs, this aggressivity level could remove the entire text from a Page.

-- Level 3: All Watermarks from levels 1 and 2 and all graphical elements are removed from the PDF. The only chance for the Retraction Watermark not to be removed with such a level of aggressivity is either the PDF is corrupted or the Retraction Watermark embedded as an Image File. In this case, we will preserve the Watermark since this function is designed not to erase any image/photo from the PDF.

Quick Start

Environment used to run the project:

Python >= 3.8
PyPDF4, PyMuPDF

There is a requirements.txt that can help with the environment.

$ pip install -f requirements.txt

Usage

Run Watermark remover

$ python PDFSolvent -i <PDF-input> -o <PDF-output> -m [mode of aggressivity]

Docker Version

You can build a docker with the Dockerfile

$ docker build -t pdf-watermark-removal .

OR

Pull the docker from Dockerhub

$ docker pull phillipecardenuto/pdf-watermark-removal
$ docker tag phillipecardenuto/pdf-watermark-removal pdf-watermark-removal

Then, run the docker by:

$ docker --rm -v <pdf-input-dirname>:/pdf/ pdf-watermark-removal -i <pdf-input-basename> -o <pdf-output-basename> -m <mode>

After running this command, you should see the output PDF <pdf-output-basename> in the same directory of the input PDF.

Example:

docker run -it --rm -v $(realpath test):/pdf/ pdf-watermark-removal -i 10.1371_journal.pone.0003856.pdf -o output.pdf -m 1

Cases of Success

1: The Watermark is explicit informed in each Page-Resource-XObject field:

Using RUPS from IText©

2: The Watermark is a word that overlays the PDF body Text:

SOURCE : https://doi.org/10.1016/j.vetmic.2019.01.004

% Content Stream of PDF above
% the RETRACTED mark using text operations 
BT
/TT0 1 Tf
0 Tc
0 Tw
0 Ts
100 Tz
0 Tr
72 0 0 72 0 -72 Tm
(RETRACTED) Tj
ET

3. The Watermark is generated by Graphic Path Construction

SOURCE : https://doi.org/10.1002/hep.26505

% Content Stream of PDF above
% There isn't any Xobject related to the watermark, it is written by a set of graphical PDF istructions
stream
/Figure BMC
q
1 0 0 1 22.5352 -66.9826 cm
0 0 595.276 841.89 re
W
n
/GS4 gs
1 0 0 1 80.9177 327.6761 cm
0 0 0 0.25 k
0 0 m
-1.326 1.526 -2.462 2.332 -3.406 2.418 c
....
h
f
EMC
Q
endstream

Case of Failure

A case of failure occurs when the PDF is corrupted, and the 'RETRACTED' word is split into subwords along the PDF stream.

For this case, we recommend to download another version of the article and try again.

SOURCE : https://doi.org/10.1016/S0022-2275(20)30277-7

% Content Stream of PDF above
% The PDF is not well organized, and the watermark is not explicit inform in the Resources
% The following stack of instruction are outside of a 'q' 'Q' sequence
% The 'RETRACTED' word is splited in many subwords (RET),(RA),(CTED)

stream
BT
/CS1 cs
1 0 0 scn
1 i
/GS3 gs
/T1_6 1 Tf
61.0339 28.6974 -28.698 61.035 131.1588 287.8868 Tm
(RET) Tj
1.9802 -0.0003 Td
(RA) Tj
1.4071 -0.0003 Td
(CTED) Tj
-3.3847 0.0003 Td
(RET) Tj
1.9806 0.0004 Td
(RA) Tj
1.4071 -0.0003 Td
(CTED) Tj
ET
endstream

AUTHOR

João Phillipe Cardenuto,

				 UNICAMP (University of Campinas) RECOD Lab

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.doc-figs		.doc-figs
.figs		.figs
PDFSolvent		PDFSolvent
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
documentation.md		documentation.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Watermark Remover

Quick Start

Usage

Docker Version

Cases of Success

1: The Watermark is explicit informed in each Page-Resource-XObject field:

2: The Watermark is a word that overlays the PDF body Text:

3. The Watermark is generated by Graphic Path Construction

Case of Failure

AUTHOR

About

Uh oh!

Releases

Packages

Languages

phillipecardenuto/watermark-removal

Folders and files

Latest commit

History

Repository files navigation

PDF Watermark Remover

Quick Start

Usage

Docker Version

Cases of Success

1: The Watermark is explicit informed in each Page-Resource-XObject field:

2: The Watermark is a word that overlays the PDF body Text:

3. The Watermark is generated by Graphic Path Construction

Case of Failure

AUTHOR

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages