PDF Watson is a minimalist tool developed in Python designed to scan PDF files for malicious code. The tool extracts metadata from the file and searches for common patterns associated with malicious codes, potentially harmful JavaScript, and dangerous embedded files.
This is a proof of concept and should be used with caution at your own risk and always in controlled environments. This software has no guarantee or responsibility.
- PDF Watson - Security Inspector for PDFs
- Important Note
- Features
- Requirements
- Installation
- Basic Usage
- Results
- Common Issues
- License
- Metadata Extraction: Obtains information about the author, title, creation date, etc.
- Malicious Code Search: Detects common patterns in JavaScript and other elements that may indicate malicious codes.
- Detection of Dangerous Embedded Files: Identifies embedded files within the PDF with potentially dangerous extensions.
For more details, see documentation.en.md
- Python 3.9 or higher
- Libraries:
- PyPDF2
- magic
- Tkinter
You can install the necessary libraries with the following command:
pip install -r requirements.txt
Clone this repository.
git clone https://github.com/tu_usuario/PDF-Watson.git
cd PDF-Watson
python -m venv watson_env
watson_env\Scripts\activate # On Windows
# For macOS/Linux use: source watson_env/bin/activate
pip install -r requirements.txt
-
Run the Script:
python PDF-Watson.py
-
Graphical User Interface: When you run the script, a window will open where you can select a PDF file or a directory to perform an inspection.
To select a file, click on "Scan PDF". If you want to inspect a directory containing PDFs for batch inspections, select "Scan Directory".
Below the buttons are 4 tabs with the results.
- Summary: Displays a summary of the inspection.
- Metadata: Shows the metadata.
- Security Analysis: Condenses relevant information about JavaScript code found.
- Log: Records the inspection history, operations, and execution errors stored in
pdf_watson.log
.
After performing an inspection, you can export the results to a .txt file.
- File Metadata: The extracted metadata from the PDF file will be displayed, such as author, title, creation date, etc.
- Malicious Code Alerts: If malicious patterns are detected in the JavaScript content, corresponding alerts will be shown.
- Dangerous Embedded Files: Embedded files within the PDF that may be dangerous will be identified, and alerts will be generated if necessary.
- Log: Documents the inspections performed, operations, and execution errors.
If the document in question is related to programming, it may identify plain text code as malicious.
If the file does not open correctly, make sure:
- Verify the file path.
- You have permissions to read the file.
If you encounter errors while running the script, check the pdf_watson.log
file for more detailed information about the error.
- Improve the GUI
- Add the ability to ignore plain text to avoid false positives
- Contextual JavaScript code viewer
- Incorporate techniques learned from https://blog.didierstevens.com/
This project is licensed under the GNU General Public License.