This project attempts to provide pipelines and tools for training SNI-code1-predicting NLP models.
The training toolchain can:
- Create labeled training data by:
- Polling the national statistics agency (SCB)'s API.
- Matching URLs for every company found (if the company has a website).
- Automatically scraping the websites of these companies.
- Preprocessing the data with heuristic methods.
- Divide the data into training-, validation- and test-sets.
- Train a spaCy model using the datasets.
The evaluation toolchain can:
- Scrape a single website.
- Preprocess.
- Use a trained model to predict the results.
The following commands are defined by the project. Commands are only re-run if their inputs have changed.
Command | Description | Requirements |
---|---|---|
SCB |
Get data from SCB | SCB FDB API credentials and certificate & MongoDB instance |
google |
Fill the DB with a matching URL for each company by using Google search API | Google Custom Search JSON API credentials and a Google Programmable Search Engine & MongoDB instance |
scrape |
Scrapes websites | |
extract |
Extracts the valuable data from the scraped website | MongoDB instance |
divide |
Divides the dataset into training and validation sets | MongoDB instance |
preprocess |
Convert the data to spaCy's binary format | MongoDB instance |
train-models |
Train a text classification model | MongoDB instance |
evaluate-accuracy-prod |
Evaluate the prod model for accuracy and export metrics | |
evaluate-speed-prod |
Evaluate the prod model for speed and export metrics | |
evaluate-accuracy-dev |
Evaluate the dev model for accuracy and export metrics | |
evaluate-speed-dev |
Evaluate the dev model for and export metrics | |
predict |
Predict the SNI code of a company based on their website data | |
eval-custom |
Custom evaluation of the model |
The following workflows are defined by the project. They
can be executed using spacy project run [workflow]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
evaluate-dev |
evaluate-accuracy-dev |
evaluate-prod |
evaluate-accuracy-prod |
all |
SCB → google → scrape → extract → divide → preprocess → train-models |
fetch |
SCB → google → scrape |
train |
extract → divide → preprocess → train-models |
test_without_training |
extract |
pip install -r requirements.txt
(preferably inside a Python virtual environment).- Create a new Google Programmable Search Engine, and add all URLs from
assets/google_search_blacklist.txt
to the engine blacklist. - Create a copy of
.env.example
called.env
in the root folder, and fill in the fields.GOOGLE_SEARCH_API_KEY
is gathered from Google Custom Search JSON API credentialsGOOGLE_SEARCH_ENGINE_ID
is gathered from Google Programmable Search EngineSCB_API_USER
&SCB_API_PASS
is gathered from your SCB account that you get issued when signing a contract with SCB for SCB FDB
- Copy the SCB certificate into the root folder, and rename it to
key.pfx
. - Run the program using
spacy project run <workflow name>
, where<workflow name>
should be one of the workflows fromproject.yml
(i.e.all
,fetch
,train
, etc.).- You can also create your own workflows by giving them a name and a list of commands.
NLP/
├─ adapters/ Used to abstract communication between classes, databases and files
├─ assets/ Blacklists and whitelists (.txt and .json)
├─ aux_functions/ Auxiliary functions
├─ classes/ Single-purpose classes
├─ configs/ spaCy config files
├─ pipeline/ Pipeline runner scripts
├─ tests/
├─ UML/