Skip to content

SoftwareDesignLab/openexploitdatabasescraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instructions

Spinning Up the Docker Container

First, you will need to set up your .envs/ folder. So, create an .envs/ directory in the project's base directory. Within this directory create another directory: .local/. Inside of the .local/ dir we are going to create two files: .django and .postgres. Inside of the .django file place the following:

# General
# ------------------------------------------------------------------------------
USE_DOCKER=yes
IPYTHONDIR=/app/.ipython

# Redis
# ------------------------------------------------------------------------------
REDIS_URL=redis://redis:6379/0

# Celery
# ------------------------------------------------------------------------------

# Flower
CELERY_FLOWER_USER=debug
CELERY_FLOWER_PASSWORD=debug

TOR_CONTROLLER_PASSWORD=debug

DJANGO_ACCOUNT_ALLOW_REGISTRATION=False

and inside of the .postgres file place this:

# PostgreSQL
# ------------------------------------------------------------------------------
POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POSTGRES_DB=openexploit
POSTGRES_USER=debug
POSTGRES_PASSWORD=debug

The overall structure should look like this:

.envs/
└── .local
   ├── .django
   └── .postgres

Now that you have these files inplace you can build the container with:

docker-compose -f local.yml build

This may take some time.

Once the container is built you can up it with:

docker-compose -f local.yml up

And down it with:

docker-compose -f local.yml down

ML Pipeline File Placement

The ML pipeline script requires several files to run. Here's what they are and where to place them.
First, at the base directory of the project create a folder called models inside models create two more directories called model and tokenizer. Now, inside of the models/model/ dir place the saved model folder (the folder containing the config.json and model.safetensors files).
For the saved tokenizer folder (the folder containing the files: special_tokens_maps.json, tokenizer_config.json, tokenizer.json, and vocab.txt) place it inside of the models/tokenizer/ dir.
And lastly, the ML pipeline script requires an xpath_queries JSON file. Simply place this file inside of the models/ directory.
The structure should look like this:

models/
├── model
│   └── model_V1
│       ├── config.json
│       ├── model.safetensors
├── tokenizer
│   └── tokenizer_V1
│       ├── special_tokens_map.json
│       ├── tokenizer.json
│       ├── tokenizer_config.json
│       ├── vocab.txt
├── xpath_queries_2023-07-03_1.json

DO NOTE that the model.safetensors file is too big for GitHub, therefore the models/ directory is set to be ignored by git. This means changing the model, tokenizer, and xpath queries files must be done manually. If you would like to see an example of this file structure please see production's /var/docker/openexploitdatabasescraper/models/

High Level Overview of Scrapers

The scrapers are made using Scrapy. They take an initial URL and from there crawl through the website's pagination of exploits. On each page the scraper will attempt to gather all of the exploits and their data, saving the info into the Exploit django model. Each scraper has two modes, an update mode and a full run mode. The update mode is used for just updating the database with a website's newer entries, while the full run mode crawls all the exploits the website has to offer. This is done to initially populate the database with a website's exploits archive. To avoid detection, whenever the scrapy spiders load in the next page they does so using tor via a custom downloader middleware. Tor is also used when requesting an exploits example file during a full run, but not during an update run as tor is not needed then.
More details on each scraper can be found in the spiders' README.

Exploit Model Explanation

To store an exploit and its data a django model is used. Aptly named "Exploit", this model can store several attributes of a scraped exploit. Here are all of the model's fields and their descriptions:

  • source: The source name of this exploit. Ex.: GitHub, CXSecurity, ...
  • source_url: The URL of the source.
  • name: The name of the exploit.
  • cve_id: The CVE ID of the exploit.
  • is_repo: Boolean field indicating if the exploit is a repo archive file.
  • date_published: The datetime object of when the the exploit was published or uploaded, defaults to None.
  • date_created: The datetime object of when the exploit model was created, autofills when created.
  • author: The author or uploader of the exploit, defaults to None.
  • description: A short description of the exploit, defaults to None.
  • download_failed: Boolean value indicating if a failure to download the example exploit file occured, defaults to False.
  • example_file: The exploit's example or demonstration file.
  • ignore: A flag indicating whether or not to ignore this exploit when displaying errored exploits in the admin panel.
  • fixed: A flag indicating if this exploit has been fixed or not.

The following model attributes were put in place mainly for the ML pipeline output.

  • vendor: The vendor of the exploit.
  • product: Product information.
  • vul_type: The vulnerability type.
  • risk: The risk level.
  • pub_dates: The publication dates of the exploit.
  • version: The versioning of the exploit.
  • remote_local: Remote/Local information on the exploit.
  • host_info: The host information.
  • poc: The proof of concept.
  • reproduce: Steps to reproduce.
  • impact: The impact of the exploit.

At a minimum each exploit model will have the source, name, is_repo, and date_created filled out. The other attributes are scraped when possible or have a default value.

Please know that all downloaded exploit examples on the production server can be found in the /var/media/ directory.

Crawl Command Instructions

The ability to run all of the scrapers has been made possible via the crawl command. This custom django command allows you to control which scrapers to run (or more accurately which scrapers not to run) and whether to do an update run or a full scrape run. A full synopsis of the crawl command's arguments can be viewed by typing ./manage.py crawl --help or python manage.py crawl --help. NOTE: As this is a custom django command is must be run through manage.py, this is why ./manage.py precedes the crawl command as seen above.

Currently, there are six scrapers that can be run. The ExploitDB spider, the CXSecurity spider, the Packetstorm spider, the Repo spider, the metasploit scraper, and the NVD spider.

Below are some use case examples of the handy dandy crawl command:

  1. Let's say you want to run an update of the CXSecurity exploits, going back 2 days. Here's what the command would look like:
    ./manage.py crawl 2 --exploitdb --packetstorm --repo or ./manage.py crawl 2 -e -p -r
    Remember, the scraper flags (-e, -p, and -r) flag the scraper to NOT run as most of the time you'll want to update using all scrapers.
    Speaking of updating all of the sources...
  2. If you want to update all of the sources going back 5 days, you would run:
    ./manage.py crawl 5 or if you wanted to scrape the entirety of all sources you would run ./manage.py crawl. By not specifying the max number of days to go back the scrapers will simply scrape everything.
  3. Some scrapers (such as CXSecurity) have the option to specify a start and end page for their source's exploit pagination: ./manage.py crawl -s 10 -l 20 This command would result in CXSecurity and Packetstorm scraping pages 10-20, however the ExploitDB and Repo spiders do not make use of these parameters and would do a full run as n_days is not provided.
    In this case, it would be good to flag these two scrapers for no run: ./manage.py crawl -s 10 -l 20 -e -r
    Now, only Packetstorm and CXSecurity will run using the pagination parameters handed to them.

Integrating a New Spider

Below are instructions on how to integrate a new Scrapy spider into the project. But first, it would be a good idea to familiarize yourself with how a Scrapy spider works. It would also be a good idea to look into and re-use various logics within the current spiders. Such as the update delta, tor request for a full run, etc...
Now, to integrate a new spider into the project:

  1. Start by adding your new spider to the same directory all the other spiders are located.
  2. Once your spider is able to scrape all of the desired info, use the save_exploit helper function to save the scraped exploit info to the database as an exploit model. This function takes in the various attributes of the exploit model and assembles them into an exploit model, saving it to the database for you.
  3. Lastly, in order for the crawl command to run your new spider, you'll need to add it to the command's script.
    First, import your spider.
    Now, add a no-run flag for your scraper in add_arguments()
    And lastly, add you spider to handle() just like the other scrapers.

Creating and Exporting Database Snapshot

You'll need SSH access on the production server, and will need to first SSH into: openexploit.crc.nd.edu

Change directory to where the code is deployed: /opt/docker/openexploitdatabasescraper

Run the following command to create a new backup of the database:

docker compose -f production.yml exec postgres backup

This will create a new backup (let's call it backup_2023_10_01T12_00_00.sql.gz) in the container's /backups directory.

Next, copy the new file from the docker container to your home directory:

docker cp $(docker compose -f production.yml ps -q postgres):/backups/backup_2023_10_01T12_00_00.sql.gz ~/

Once this is in your home directory, you can secure-copy it to your local machine (in a new terminal):

scp [USERNAME]@openexploit.crc.nd.edu:/home/[USERNAME]/backup_2023_10_01T12_00_00.sql.gz .

Finally, you can copy this to your local docker location and restore it.

NOTE: You'll need to make sure you only bring up the postgres service. If Django has an open connection to the database, the restore could fail!

Bring up only the postgres service with:

docker compose -f local.yml up postgres

Then copy the exported data into your local postgres's /backups folder with:

docker cp backup_2023_10_01T12_00_00.sql.gz $(docker compose -f production.yml ps -q postgres):/backups

And finally, restore it:

docker compose -f local.yml exec postgres restore backup_2023_10_01T12_00_00.sql.gz

Accessing and Using the Exploit's API

There are two ways of accessing the API: through the browser or by requesting an authentication token and accessing the endpoint using that token.

Accessing API through the browser

In order to access the Exploit's API endpoint via the browser, you just have to be logged in. Once logged in simply head to the /api/exploits/ endpoint.

Accessing the API Through a Token

If you're wanting to access the API through a script or through an API platform such as Postman, this is the way to do it.
First, you need to get an authentication token, to do this you must have an account. To get an authentication token you'll need to make a POST request to /api/authenticate/ with a data payload containing your login credentials in json format. E.g.,

{
    "username": "<username_here>",
    "password": "<password_here>",
}

Provided that your credentials are valid, the response's content will contain your token formatted as:

{
    "token": "<token_key_here>",
}

Using your token, you can now set your authorization as follows:

{
    "Authorization": "Token <token_key_here>"
}

Once set, you will be able to access the /api/exploits/ endpoint.

Exploit Endpoint Parameters

There are three parameters you can make use of: page, page_size, and cve_id.
By default page=1and page_size=100
If you are accessing the endpoint through a browser you'll simply add these onto the url, here's a few examples:
Let's say you wanted to access page three of the exploits where each page contains 10 entries:

  • /api/exploits/?page=3&page_size=10

Now let's say you only want exploits who's CVE ID contains the number 2, you would set the cve_id param to 2:

  • /api/exploits/?cve_id=2

If you are accessing the endpoint through a script you'll add your parameters into a json formatted data load. Here's a a couple examples:
Here's the equivalent to the first example seen above:

{
    "page": 3,
    "page_size": 10,
}

And here's the equivalent to the second example seen above:

{
    "cve_id": 2,
}

Production Nightly Runs

Nightly pulls are setup in a cron-job scheduled under the production root user's cron tab. These can be edited by SSH'ing into the production server, elevate to root (with sudo su) and opening the crontab:

crontab -e

You'll find the only entry for running nightly scrapes defined as:

0 0 * * * cd /opt/docker/openexploitdatabasescraper/ && docker compose -f production.yml run --rm django python manage.py crawl --all -n 2 > /var/log/crawl.log 2>&1

This line can be broken down into the following instructions:

  • Launch at midnight every night
  • Change directory to where the code is deployed (/opt/docker/openexploitdatabasescraper/)
  • Run the docker compose command to launch a new Django instance
  • Execute the python manage.py crawl Django command within that container
  • Save all records, regardless of their CVE-ID status (--all)
  • Ignore the NVD scraper (-n) NOTE: at the time of this writing, the model was still buggy, causing this crawler to crash. Removing the -n attribute from this command will enable the NVD scraper during nightly scrapes.
  • Look for vulnerabilities published within the past 2 days only (2) -- this allows us to get only recently published vulnerabilities, keeping the run time of the crawlers at a minimum
  • Dump any output to /var/log/crawl.log on the host (not internal docker container), including any errors (2>&1)

Disabled NVD

As mentioned above, at the time of this writing, the NVD scraper was still buggy due to model calling errors. As a result, this spider is disabled on the nightly runs. To re-enable this spider, remove the -n flag in the cron job.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •