First, you will need to set up your .envs/
folder. So, create an .envs/
directory in the project's base directory. Within this directory create another directory: .local/
. Inside of the .local/
dir we are going to create two files: .django
and .postgres
. Inside of the .django
file place the following:
# General
# ------------------------------------------------------------------------------
USE_DOCKER=yes
IPYTHONDIR=/app/.ipython
# Redis
# ------------------------------------------------------------------------------
REDIS_URL=redis://redis:6379/0
# Celery
# ------------------------------------------------------------------------------
# Flower
CELERY_FLOWER_USER=debug
CELERY_FLOWER_PASSWORD=debug
TOR_CONTROLLER_PASSWORD=debug
DJANGO_ACCOUNT_ALLOW_REGISTRATION=False
and inside of the .postgres
file place this:
# PostgreSQL
# ------------------------------------------------------------------------------
POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POSTGRES_DB=openexploit
POSTGRES_USER=debug
POSTGRES_PASSWORD=debug
The overall structure should look like this:
.envs/
└── .local
├── .django
└── .postgres
Now that you have these files inplace you can build the container with:
docker-compose -f local.yml build
This may take some time.
Once the container is built you can up it with:
docker-compose -f local.yml up
And down it with:
docker-compose -f local.yml down
The ML pipeline script requires several files to run. Here's what they are and where to place them.
First, at the base directory of the project create a folder called models
inside models create two more directories called model
and tokenizer
.
Now, inside of the models/model/
dir place the saved model folder (the folder containing the config.json
and model.safetensors
files).
For the saved tokenizer folder (the folder containing the files: special_tokens_maps.json
, tokenizer_config.json
, tokenizer.json
, and vocab.txt
) place it inside of the models/tokenizer/
dir.
And lastly, the ML pipeline script requires an xpath_queries
JSON file. Simply place this file inside of the models/
directory.
The structure should look like this:
models/
├── model
│ └── model_V1
│ ├── config.json
│ ├── model.safetensors
├── tokenizer
│ └── tokenizer_V1
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ ├── vocab.txt
├── xpath_queries_2023-07-03_1.json
DO NOTE that the model.safetensors
file is too big for GitHub, therefore the models/
directory is set to be ignored by git. This means changing the model, tokenizer, and xpath queries files must be done manually.
If you would like to see an example of this file structure please see production's /var/docker/openexploitdatabasescraper/models/
The scrapers are made using Scrapy. They take an initial URL and from there crawl through the website's pagination of exploits. On each page the scraper will attempt to gather all of the exploits and their data, saving the info into the Exploit django model. Each scraper has two modes, an update mode and a full run mode. The update mode is used for just updating the database with a website's newer entries, while the full run mode crawls all the exploits the website has to offer. This is done to initially populate the database with a website's exploits archive. To avoid detection, whenever the scrapy spiders load in the next page they does so using tor via a custom downloader middleware. Tor is also used when requesting an exploits example file during a full run, but not during an update run as tor is not needed then.
More details on each scraper can be found in the spiders' README.
To store an exploit and its data a django model is used. Aptly named "Exploit", this model can store several attributes of a scraped exploit. Here are all of the model's fields and their descriptions:
source
: The source name of this exploit. Ex.: GitHub, CXSecurity, ...source_url
: The URL of the source.name
: The name of the exploit.cve_id
: The CVE ID of the exploit.is_repo
: Boolean field indicating if the exploit is a repo archive file.date_published
: The datetime object of when the the exploit was published or uploaded, defaults to None.date_created
: The datetime object of when the exploit model was created, autofills when created.author
: The author or uploader of the exploit, defaults to None.description
: A short description of the exploit, defaults to None.download_failed
: Boolean value indicating if a failure to download the example exploit file occured, defaults to False.example_file
: The exploit's example or demonstration file.ignore
: A flag indicating whether or not to ignore this exploit when displaying errored exploits in the admin panel.fixed
: A flag indicating if this exploit has been fixed or not.
The following model attributes were put in place mainly for the ML pipeline output.
vendor
: The vendor of the exploit.product
: Product information.vul_type
: The vulnerability type.risk
: The risk level.pub_dates
: The publication dates of the exploit.version
: The versioning of the exploit.remote_local
: Remote/Local information on the exploit.host_info
: The host information.poc
: The proof of concept.reproduce
: Steps to reproduce.impact
: The impact of the exploit.
At a minimum each exploit model will have the source, name, is_repo, and date_created filled out. The other attributes are scraped when possible or have a default value.
Please know that all downloaded exploit examples on the production server can be found in the /var/media/
directory.
The ability to run all of the scrapers has been made possible via the crawl
command. This custom django command allows you to control which scrapers to run (or more accurately which scrapers not to run) and whether to do an update run or a full scrape run. A full synopsis of the crawl
command's arguments can be viewed by typing ./manage.py crawl --help
or python manage.py crawl --help
.
NOTE: As this is a custom django command is must be run through manage.py, this is why ./manage.py
precedes the crawl
command as seen above.
Currently, there are six scrapers that can be run. The ExploitDB spider, the CXSecurity spider, the Packetstorm spider, the Repo spider, the metasploit scraper, and the NVD spider.
Below are some use case examples of the handy dandy crawl
command:
- Let's say you want to run an update of the CXSecurity exploits, going back 2 days. Here's what the command would look like:
./manage.py crawl 2 --exploitdb --packetstorm --repo
or./manage.py crawl 2 -e -p -r
Remember, the scraper flags (-e, -p, and -r) flag the scraper to NOT run as most of the time you'll want to update using all scrapers.
Speaking of updating all of the sources... - If you want to update all of the sources going back 5 days, you would run:
./manage.py crawl 5
or if you wanted to scrape the entirety of all sources you would run./manage.py crawl
. By not specifying the max number of days to go back the scrapers will simply scrape everything. - Some scrapers (such as CXSecurity) have the option to specify a start and end page for their source's exploit pagination:
./manage.py crawl -s 10 -l 20
This command would result in CXSecurity and Packetstorm scraping pages 10-20, however the ExploitDB and Repo spiders do not make use of these parameters and would do a full run asn_days
is not provided.
In this case, it would be good to flag these two scrapers for no run:./manage.py crawl -s 10 -l 20 -e -r
Now, only Packetstorm and CXSecurity will run using the pagination parameters handed to them.
Below are instructions on how to integrate a new Scrapy spider into the project. But first, it would be a good idea to familiarize yourself with how a Scrapy spider works. It would also be a good idea to look into and re-use various logics within the current spiders. Such as the update delta, tor request for a full run, etc...
Now, to integrate a new spider into the project:
- Start by adding your new spider to the same directory all the other spiders are located.
- Once your spider is able to scrape all of the desired info, use the save_exploit helper function to save the scraped exploit info to the database as an exploit model. This function takes in the various attributes of the exploit model and assembles them into an exploit model, saving it to the database for you.
- Lastly, in order for the
crawl
command to run your new spider, you'll need to add it to the command's script.
First, import your spider.
Now, add a no-run flag for your scraper inadd_arguments()
And lastly, add you spider tohandle()
just like the other scrapers.
You'll need SSH access on the production server, and will need to first SSH into: openexploit.crc.nd.edu
Change directory to where the code is deployed: /opt/docker/openexploitdatabasescraper
Run the following command to create a new backup of the database:
docker compose -f production.yml exec postgres backup
This will create a new backup (let's call it backup_2023_10_01T12_00_00.sql.gz
) in the container's /backups
directory.
Next, copy the new file from the docker container to your home directory:
docker cp $(docker compose -f production.yml ps -q postgres):/backups/backup_2023_10_01T12_00_00.sql.gz ~/
Once this is in your home directory, you can secure-copy it to your local machine (in a new terminal):
scp [USERNAME]@openexploit.crc.nd.edu:/home/[USERNAME]/backup_2023_10_01T12_00_00.sql.gz .
Finally, you can copy this to your local docker location and restore it.
NOTE: You'll need to make sure you only bring up the postgres service. If Django has an open connection to the database, the restore could fail!
Bring up only the postgres service with:
docker compose -f local.yml up postgres
Then copy the exported data into your local postgres's /backups
folder with:
docker cp backup_2023_10_01T12_00_00.sql.gz $(docker compose -f production.yml ps -q postgres):/backups
And finally, restore it:
docker compose -f local.yml exec postgres restore backup_2023_10_01T12_00_00.sql.gz
There are two ways of accessing the API: through the browser or by requesting an authentication token and accessing the endpoint using that token.
In order to access the Exploit's API endpoint via the browser, you just have to be logged in. Once logged in simply head to the /api/exploits/
endpoint.
If you're wanting to access the API through a script or through an API platform such as Postman, this is the way to do it.
First, you need to get an authentication token, to do this you must have an account.
To get an authentication token you'll need to make a POST request to /api/authenticate/
with a data payload containing your login credentials in json format. E.g.,
{
"username": "<username_here>",
"password": "<password_here>",
}
Provided that your credentials are valid, the response's content will contain your token formatted as:
{
"token": "<token_key_here>",
}
Using your token, you can now set your authorization as follows:
{
"Authorization": "Token <token_key_here>"
}
Once set, you will be able to access the /api/exploits/
endpoint.
There are three parameters you can make use of: page
, page_size
, and cve_id
.
By default page=1
and page_size=100
If you are accessing the endpoint through a browser you'll simply add these onto the url, here's a few examples:
Let's say you wanted to access page three of the exploits where each page contains 10 entries:
/api/exploits/?page=3&page_size=10
Now let's say you only want exploits who's CVE ID contains the number 2, you would set the cve_id
param to 2:
/api/exploits/?cve_id=2
If you are accessing the endpoint through a script you'll add your parameters into a json formatted data load. Here's a a couple examples:
Here's the equivalent to the first example seen above:
{
"page": 3,
"page_size": 10,
}
And here's the equivalent to the second example seen above:
{
"cve_id": 2,
}
Nightly pulls are setup in a cron-job scheduled under the production root user's cron tab. These can be edited by SSH'ing into the production server, elevate to root (with sudo su
) and opening the crontab:
crontab -e
You'll find the only entry for running nightly scrapes defined as:
0 0 * * * cd /opt/docker/openexploitdatabasescraper/ && docker compose -f production.yml run --rm django python manage.py crawl --all -n 2 > /var/log/crawl.log 2>&1
This line can be broken down into the following instructions:
- Launch at midnight every night
- Change directory to where the code is deployed (
/opt/docker/openexploitdatabasescraper/
) - Run the
docker compose
command to launch a new Django instance - Execute the
python manage.py crawl
Django command within that container - Save all records, regardless of their CVE-ID status (
--all
) - Ignore the NVD scraper (
-n
) NOTE: at the time of this writing, the model was still buggy, causing this crawler to crash. Removing the-n
attribute from this command will enable the NVD scraper during nightly scrapes. - Look for vulnerabilities published within the past 2 days only (
2
) -- this allows us to get only recently published vulnerabilities, keeping the run time of the crawlers at a minimum - Dump any output to
/var/log/crawl.log
on the host (not internal docker container), including any errors (2>&1
)
As mentioned above, at the time of this writing, the NVD scraper was still buggy due to model calling errors. As a result, this spider is disabled on the nightly runs. To re-enable this spider, remove the -n
flag in the cron job.