Instructions

Spinning Up the Docker Container

First, you will need to set up your .envs/ folder. So, create an .envs/ directory in the project's base directory. Within this directory create another directory: .local/. Inside of the .local/ dir we are going to create two files: .django and .postgres. Inside of the .django file place the following:

# General
# ------------------------------------------------------------------------------
USE_DOCKER=yes
IPYTHONDIR=/app/.ipython

# Redis
# ------------------------------------------------------------------------------
REDIS_URL=redis://redis:6379/0

# Celery
# ------------------------------------------------------------------------------

# Flower
CELERY_FLOWER_USER=debug
CELERY_FLOWER_PASSWORD=debug

TOR_CONTROLLER_PASSWORD=debug

DJANGO_ACCOUNT_ALLOW_REGISTRATION=False

and inside of the .postgres file place this:

# PostgreSQL
# ------------------------------------------------------------------------------
POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POSTGRES_DB=openexploit
POSTGRES_USER=debug
POSTGRES_PASSWORD=debug

The overall structure should look like this:

.envs/
└── .local
   ├── .django
   └── .postgres

Now that you have these files inplace you can build the container with:

docker-compose -f local.yml build

This may take some time.

Once the container is built you can up it with:

docker-compose -f local.yml up

And down it with:

docker-compose -f local.yml down

ML Pipeline File Placement

The ML pipeline script requires several files to run. Here's what they are and where to place them.
First, at the base directory of the project create a folder called models inside models create two more directories called model and tokenizer. Now, inside of the models/model/ dir place the saved model folder (the folder containing the config.json and model.safetensors files).
For the saved tokenizer folder (the folder containing the files: special_tokens_maps.json, tokenizer_config.json, tokenizer.json, and vocab.txt) place it inside of the models/tokenizer/ dir.
And lastly, the ML pipeline script requires an xpath_queries JSON file. Simply place this file inside of the models/ directory.
The structure should look like this:

models/
├── model
│   └── model_V1
│       ├── config.json
│       ├── model.safetensors
├── tokenizer
│   └── tokenizer_V1
│       ├── special_tokens_map.json
│       ├── tokenizer.json
│       ├── tokenizer_config.json
│       ├── vocab.txt
├── xpath_queries_2023-07-03_1.json

DO NOTE that the model.safetensors file is too big for GitHub, therefore the models/ directory is set to be ignored by git. This means changing the model, tokenizer, and xpath queries files must be done manually. If you would like to see an example of this file structure please see production's /var/docker/openexploitdatabasescraper/models/

High Level Overview of Scrapers

The scrapers are made using Scrapy. They take an initial URL and from there crawl through the website's pagination of exploits. On each page the scraper will attempt to gather all of the exploits and their data, saving the info into the Exploit django model. Each scraper has two modes, an update mode and a full run mode. The update mode is used for just updating the database with a website's newer entries, while the full run mode crawls all the exploits the website has to offer. This is done to initially populate the database with a website's exploits archive. To avoid detection, whenever the scrapy spiders load in the next page they does so using tor via a custom downloader middleware. Tor is also used when requesting an exploits example file during a full run, but not during an update run as tor is not needed then.
More details on each scraper can be found in the spiders' README.

Exploit Model Explanation

To store an exploit and its data a django model is used. Aptly named "Exploit", this model can store several attributes of a scraped exploit. Here are all of the model's fields and their descriptions:

source: The source name of this exploit. Ex.: GitHub, CXSecurity, ...
source_url: The URL of the source.
name: The name of the exploit.
cve_id: The CVE ID of the exploit.
is_repo: Boolean field indicating if the exploit is a repo archive file.
date_published: The datetime object of when the the exploit was published or uploaded, defaults to None.
date_created: The datetime object of when the exploit model was created, autofills when created.
author: The author or uploader of the exploit, defaults to None.
description: A short description of the exploit, defaults to None.
download_failed: Boolean value indicating if a failure to download the example exploit file occured, defaults to False.
example_file: The exploit's example or demonstration file.
ignore: A flag indicating whether or not to ignore this exploit when displaying errored exploits in the admin panel.
fixed: A flag indicating if this exploit has been fixed or not.

The following model attributes were put in place mainly for the ML pipeline output.

vendor: The vendor of the exploit.
product: Product information.
vul_type: The vulnerability type.
risk: The risk level.
pub_dates: The publication dates of the exploit.
version: The versioning of the exploit.
remote_local: Remote/Local information on the exploit.
host_info: The host information.
poc: The proof of concept.
reproduce: Steps to reproduce.
impact: The impact of the exploit.

At a minimum each exploit model will have the source, name, is_repo, and date_created filled out. The other attributes are scraped when possible or have a default value.

Please know that all downloaded exploit examples on the production server can be found in the /var/media/ directory.

Crawl Command Instructions

The ability to run all of the scrapers has been made possible via the crawl command. This custom django command allows you to control which scrapers to run (or more accurately which scrapers not to run) and whether to do an update run or a full scrape run. A full synopsis of the crawl command's arguments can be viewed by typing ./manage.py crawl --help or python manage.py crawl --help. NOTE: As this is a custom django command is must be run through manage.py, this is why ./manage.py precedes the crawl command as seen above.

Currently, there are six scrapers that can be run. The ExploitDB spider, the CXSecurity spider, the Packetstorm spider, the Repo spider, the metasploit scraper, and the NVD spider.

Below are some use case examples of the handy dandy crawl command:

Let's say you want to run an update of the CXSecurity exploits, going back 2 days. Here's what the command would look like:
./manage.py crawl 2 --exploitdb --packetstorm --repo or ./manage.py crawl 2 -e -p -r
Remember, the scraper flags (-e, -p, and -r) flag the scraper to NOT run as most of the time you'll want to update using all scrapers.
Speaking of updating all of the sources...
If you want to update all of the sources going back 5 days, you would run:
./manage.py crawl 5 or if you wanted to scrape the entirety of all sources you would run ./manage.py crawl. By not specifying the max number of days to go back the scrapers will simply scrape everything.
Some scrapers (such as CXSecurity) have the option to specify a start and end page for their source's exploit pagination: ./manage.py crawl -s 10 -l 20 This command would result in CXSecurity and Packetstorm scraping pages 10-20, however the ExploitDB and Repo spiders do not make use of these parameters and would do a full run as n_days is not provided.
In this case, it would be good to flag these two scrapers for no run: ./manage.py crawl -s 10 -l 20 -e -r
Now, only Packetstorm and CXSecurity will run using the pagination parameters handed to them.

Integrating a New Spider

Below are instructions on how to integrate a new Scrapy spider into the project. But first, it would be a good idea to familiarize yourself with how a Scrapy spider works. It would also be a good idea to look into and re-use various logics within the current spiders. Such as the update delta, tor request for a full run, etc...
Now, to integrate a new spider into the project:

Start by adding your new spider to the same directory all the other spiders are located.
Once your spider is able to scrape all of the desired info, use the save_exploit helper function to save the scraped exploit info to the database as an exploit model. This function takes in the various attributes of the exploit model and assembles them into an exploit model, saving it to the database for you.
Lastly, in order for the crawl command to run your new spider, you'll need to add it to the command's script.
First, import your spider.
Now, add a no-run flag for your scraper in add_arguments()
And lastly, add you spider to handle() just like the other scrapers.

Creating and Exporting Database Snapshot

You'll need SSH access on the production server, and will need to first SSH into: openexploit.crc.nd.edu

Change directory to where the code is deployed: /opt/docker/openexploitdatabasescraper

Run the following command to create a new backup of the database:

docker compose -f production.yml exec postgres backup

This will create a new backup (let's call it backup_2023_10_01T12_00_00.sql.gz) in the container's /backups directory.

Next, copy the new file from the docker container to your home directory:

docker cp $(docker compose -f production.yml ps -q postgres):/backups/backup_2023_10_01T12_00_00.sql.gz ~/

Once this is in your home directory, you can secure-copy it to your local machine (in a new terminal):

scp [USERNAME]@openexploit.crc.nd.edu:/home/[USERNAME]/backup_2023_10_01T12_00_00.sql.gz .

Finally, you can copy this to your local docker location and restore it.

NOTE: You'll need to make sure you only bring up the postgres service. If Django has an open connection to the database, the restore could fail!

Bring up only the postgres service with:

docker compose -f local.yml up postgres

Then copy the exported data into your local postgres's /backups folder with:

docker cp backup_2023_10_01T12_00_00.sql.gz $(docker compose -f production.yml ps -q postgres):/backups

And finally, restore it:

docker compose -f local.yml exec postgres restore backup_2023_10_01T12_00_00.sql.gz

Accessing and Using the Exploit's API

There are two ways of accessing the API: through the browser or by requesting an authentication token and accessing the endpoint using that token.

Accessing API through the browser

In order to access the Exploit's API endpoint via the browser, you just have to be logged in. Once logged in simply head to the /api/exploits/ endpoint.

Accessing the API Through a Token

If you're wanting to access the API through a script or through an API platform such as Postman, this is the way to do it.
First, you need to get an authentication token, to do this you must have an account. To get an authentication token you'll need to make a POST request to /api/authenticate/ with a data payload containing your login credentials in json format. E.g.,

{
    "username": "<username_here>",
    "password": "<password_here>",
}

Provided that your credentials are valid, the response's content will contain your token formatted as:

{
    "token": "<token_key_here>",
}

Using your token, you can now set your authorization as follows:

{
    "Authorization": "Token <token_key_here>"
}

Once set, you will be able to access the /api/exploits/ endpoint.

Exploit Endpoint Parameters

There are three parameters you can make use of: page, page_size, and cve_id.
By default page=1and page_size=100
If you are accessing the endpoint through a browser you'll simply add these onto the url, here's a few examples:
Let's say you wanted to access page three of the exploits where each page contains 10 entries:

/api/exploits/?page=3&page_size=10

Now let's say you only want exploits who's CVE ID contains the number 2, you would set the cve_id param to 2:

/api/exploits/?cve_id=2

If you are accessing the endpoint through a script you'll add your parameters into a json formatted data load. Here's a a couple examples:
Here's the equivalent to the first example seen above:

{
    "page": 3,
    "page_size": 10,
}

And here's the equivalent to the second example seen above:

{
    "cve_id": 2,
}

Production Nightly Runs

Nightly pulls are setup in a cron-job scheduled under the production root user's cron tab. These can be edited by SSH'ing into the production server, elevate to root (with sudo su) and opening the crontab:

crontab -e

You'll find the only entry for running nightly scrapes defined as:

0 0 * * * cd /opt/docker/openexploitdatabasescraper/ && docker compose -f production.yml run --rm django python manage.py crawl --all -n 2 > /var/log/crawl.log 2>&1

This line can be broken down into the following instructions:

Launch at midnight every night
Change directory to where the code is deployed (/opt/docker/openexploitdatabasescraper/)
Run the docker compose command to launch a new Django instance
Execute the python manage.py crawl Django command within that container
Save all records, regardless of their CVE-ID status (--all)
Ignore the NVD scraper (-n) NOTE: at the time of this writing, the model was still buggy, causing this crawler to crash. Removing the -n attribute from this command will enable the NVD scraper during nightly scrapes.
Look for vulnerabilities published within the past 2 days only (2) -- this allows us to get only recently published vulnerabilities, keeping the run time of the crawlers at a minimum
Dump any output to /var/log/crawl.log on the host (not internal docker container), including any errors (2>&1)

Disabled NVD

As mentioned above, at the time of this writing, the NVD scraper was still buggy due to model calling errors. As a result, this spider is disabled on the nightly runs. To re-enable this spider, remove the -n flag in the cron job.

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
compose		compose
config		config
data		data
docs		docs
exploit_scrapers		exploit_scrapers
exploits		exploits
git_vul_driller		git_vul_driller
locale		locale
open_exploit_database_scraping		open_exploit_database_scraping
requirements		requirements
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
CONTRIBUTORS.txt		CONTRIBUTORS.txt
DASHBOARD-README.md		DASHBOARD-README.md
DJANGO-README.md		DJANGO-README.md
Dockerfile.celery		Dockerfile.celery
LICENSE		LICENSE
README.md		README.md
Scraper-Tor Scenarios.rnote		Scraper-Tor Scenarios.rnote
local.yml		local.yml
manage.py		manage.py
merge_production_dotenvs_in_dotenv.py		merge_production_dotenvs_in_dotenv.py
production.yml		production.yml
pyproject.toml		pyproject.toml
selenium_test.py		selenium_test.py
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Instructions

Spinning Up the Docker Container

ML Pipeline File Placement

High Level Overview of Scrapers

Exploit Model Explanation

Crawl Command Instructions

Integrating a New Spider

Creating and Exporting Database Snapshot

Accessing and Using the Exploit's API

Accessing API through the browser

Accessing the API Through a Token

Exploit Endpoint Parameters

Production Nightly Runs

Disabled NVD

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

SoftwareDesignLab/openexploitdatabasescraper

Folders and files

Latest commit

History

Repository files navigation

Instructions

Spinning Up the Docker Container

ML Pipeline File Placement

High Level Overview of Scrapers

Exploit Model Explanation

Crawl Command Instructions

Integrating a New Spider

Creating and Exporting Database Snapshot

Accessing and Using the Exploit's API

Accessing API through the browser

Accessing the API Through a Token

Exploit Endpoint Parameters

Production Nightly Runs

Disabled NVD

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages