Skip to content

Harshith1201/malware-threat-intelligence-scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malware Threat Intelligence Scraper

License: MIT Python 3.9+ GitHub issues GitHub stars

An open-source tool to scrape malware, vulnerabilities, and phishing data from various sources including Reddit, BleepingComputer, X (Twitter), CISA, Pastebin, and PhishTank. The data is stored in SQLite for ethical hackers and security researchers.

Malware Threat Intelligence Scraper

Note: This tool is intended for ethical hacking and security research only. Please respect the terms of service of the data sources and privacy laws.

Features

  • Multi-source Scraping: Collects data from Reddit, BleepingComputer, X, CISA, Pastebin, and PhishTank
  • Automated Data Collection: Scheduled scraping to keep the database up-to-date
  • Command-line Interface: Easy access to the collected data
  • Web Dashboard: Visual representation of the collected data
  • API: Programmatic access to the collected data
  • Community Features: Validation of collected data

What It Collects

  • IP Addresses: Potentially malicious IP addresses
  • Hashes: MD5, SHA256, and other hashes of malware samples
  • CVEs: Common Vulnerabilities and Exposures identifiers
  • URLs: Malicious and phishing URLs
  • TTPs: Tactics, Techniques, and Procedures used by threat actors

Quick Start

For a quick demonstration of all features:

python demo.py

This will:

  1. Initialize the database with sample data
  2. Demonstrate the CLI features
  3. Start the web dashboard and API
  4. Open the web dashboard and API documentation in your browser
  5. Demonstrate the validation feature
  6. Run a sample spider

Setup

Prerequisites

  • Python 3.9+
  • Git
  • Chrome/Chromium (for the X spider which uses Selenium)

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/malware-scraper.git
    cd malware-scraper
    
  2. Create a virtual environment:

    python -m venv venv
    
  3. Activate the virtual environment:

    • Windows: venv\Scripts\activate
    • Linux/macOS: source venv/bin/activate
  4. Run the installation script:

    python install.py
    

    This will install all required dependencies and download the spaCy model.

  5. Set up Reddit API credentials:

    • Go to https://www.reddit.com/prefs/apps and create a new app
    • Select "script" as the app type
    • Create a .env file with the following content:
      REDDIT_CLIENT_ID=your_client_id
      REDDIT_CLIENT_SECRET=your_client_secret
      REDDIT_USER_AGENT=malware-scraper/1.0
      

Usage

Demo Scripts

The project includes several demo scripts to help you get started:

  • demo.py: Demonstrates all features
  • demo_cli.py: Demonstrates the CLI
  • demo_dashboard.py: Demonstrates the web dashboard
  • demo_api.py: Demonstrates the API
  • demo_validate.py: Demonstrates the validation feature
  • demo_scheduler.py: Demonstrates the scheduler

Running the Scrapers

You can run individual spiders:

cd scraper
scrapy crawl reddit_spider
scrapy crawl bleeping_spider
scrapy crawl x_spider
scrapy crawl cisa_spider
scrapy crawl pastebin_spider
scrapy crawl phishtank_spider

Or run all spiders at once:

python run_spiders.py

Or run all spiders automatically using the scheduler:

python scheduler.py

Using the CLI

List all IOCs in the database:

python -m cli.app list_iocs

Search for IOCs by name:

python -m cli.app search --name Emotet

Export all IOCs to a CSV file:

python -m cli.app export_csv --output data/my_iocs.csv

Running the Dashboard

Start the web dashboard:

python app.py

Then open http://localhost:5000 in your browser.

Features:

  • View all IOCs in a table
  • Filter by source and IOC type
  • Search by name
  • Validate or invalidate IOCs
  • View source URLs

Using the API

Start the API server:

uvicorn api:app --reload

API endpoints:

  • GET /iocs: List all IOCs (with optional filtering)
  • GET /iocs/{ioc_id}: Get a specific IOC by ID
  • GET /search?name=...: Search IOCs by name

Interactive API documentation is available at http://localhost:8000/docs.

Validating IOCs

You can validate or invalidate IOCs using the validation script:

python validate.py --id 1 --valid True

Running the Entire System

To run the entire system (dashboard and API):

python run.py

This will start the web dashboard and API, and open them in your browser.

Deployment Options

Local Deployment

For personal use or testing:

# Run the dashboard and API
python run.py

AWS EC2 Deployment

  1. Launch an EC2 instance (t2.micro for free tier)
  2. SSH into the instance:
    ssh -i key.pem ubuntu@ec2-ip
  3. Install dependencies:
    sudo apt update && sudo apt install python3 python3-pip git
    git clone https://github.com/yourusername/malware-scraper.git
    cd malware-scraper
    pip install -r requirements.txt
  4. Run as a background service:
    nohup python run.py &

Docker Deployment

  1. Build the Docker image:
    docker build -t malware-scraper .
  2. Run the container:
    docker run -p 5000:5000 -p 8000:8000 malware-scraper

See DEPLOYMENT.md for detailed deployment instructions.

Data Sources

  • Reddit: r/Malware, r/netsec, r/cybersecurity, r/hacking
  • BleepingComputer: Security news articles
  • X (Twitter): #malware hashtag
  • CISA: Known Exploited Vulnerabilities Catalog
  • Pastebin: Recent public pastes
  • PhishTank: Recent phishing URLs

Ethical Use

This tool is intended for ethical hacking and security research only. Please respect the terms of service of the data sources and privacy laws. Do not use the collected data for malicious purposes.

Legal Considerations

  • Comply with the terms of service of each data source
  • Respect rate limits and robots.txt
  • Comply with data protection regulations (GDPR, CCPA, etc.)
  • Use the data for defensive security purposes only

License

MIT

Contributing

Please see CONTRIBUTING.md for details on how to contribute to this project.

About

An open-source tool to scrape malware, vulnerabilities, and phishing data from various sources

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published