SNI-code identification by NLP

This project attempts to provide pipelines and tools for training SNI-code¹-predicting NLP models.

The training toolchain can:

Create labeled training data by:
1. Polling the national statistics agency (SCB)'s API.
2. Matching URLs for every company found (if the company has a website).
3. Automatically scraping the websites of these companies.
4. Preprocessing the data with heuristic methods.
Divide the data into training-, validation- and test-sets.
Train a spaCy model using the datasets.

The evaluation toolchain can:

Scrape a single website.
Preprocess.
Use a trained model to predict the results.

Commands

The following commands are defined by the project. Commands are only re-run if their inputs have changed.

Command	Description	Requirements
`SCB`	Get data from SCB	SCB FDB API credentials and certificate & MongoDB instance
`google`	Fill the DB with a matching URL for each company by using Google search API	Google Custom Search JSON API credentials and a Google Programmable Search Engine & MongoDB instance
`scrape`	Scrapes websites
`extract`	Extracts the valuable data from the scraped website	MongoDB instance
`divide`	Divides the dataset into training and validation sets	MongoDB instance
`preprocess`	Convert the data to spaCy's binary format	MongoDB instance
`train-models`	Train a text classification model	MongoDB instance
`evaluate-accuracy-prod`	Evaluate the prod model for accuracy and export metrics
`evaluate-speed-prod`	Evaluate the prod model for speed and export metrics
`evaluate-accuracy-dev`	Evaluate the dev model for accuracy and export metrics
`evaluate-speed-dev`	Evaluate the dev model for and export metrics
`predict`	Predict the SNI code of a company based on their website data
`eval-custom`	Custom evaluation of the model

Workflows

The following workflows are defined by the project. They can be executed using spacy project run [workflow] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow	Steps
`evaluate-dev`	`evaluate-accuracy-dev`
`evaluate-prod`	`evaluate-accuracy-prod`
`all`	`SCB` → `google` → `scrape` → `extract` → `divide` → `preprocess` → `train-models`
`fetch`	`SCB` → `google` → `scrape`
`train`	`extract` → `divide` → `preprocess` → `train-models`
`test_without_training`	`extract`

Setup

pip install -r requirements.txt (preferably inside a Python virtual environment).
Create a new Google Programmable Search Engine, and add all URLs from assets/google_search_blacklist.txt to the engine blacklist.
Create a copy of .env.example called .env in the root folder, and fill in the fields.
- GOOGLE_SEARCH_API_KEY is gathered from Google Custom Search JSON API credentials
- GOOGLE_SEARCH_ENGINE_ID is gathered from Google Programmable Search Engine
- SCB_API_USER & SCB_API_PASS is gathered from your SCB account that you get issued when signing a contract with SCB for SCB FDB
Copy the SCB certificate into the root folder, and rename it to key.pfx.
Run the program using spacy project run <workflow name>, where <workflow name> should be one of the workflows from project.yml (i.e. all, fetch, train, etc.).
- You can also create your own workflows by giving them a name and a list of commands.

Structure

NLP/
├─ adapters/        Used to abstract communication between classes, databases and files
├─ assets/          Blacklists and whitelists (.txt and .json)
├─ aux_functions/   Auxiliary functions
├─ classes/         Single-purpose classes
├─ configs/         spaCy config files
├─ pipeline/        Pipeline runner scripts
├─ tests/           
├─ UML/

The Swedish extension of the European NACE which provides an extra level of detail compared to NACE. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SNI-code identification by NLP

Commands

Workflows

Setup

Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 219 Commits
.vscode		.vscode
UML		UML
adapters		adapters
assets		assets
aux_functions		aux_functions
classes		classes
configs		configs
docs		docs
pipeline		pipeline
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
definitions.py		definitions.py
project.yml		project.yml
requirements.txt		requirements.txt

License

CRVCIOIS/SNI-Identification

Folders and files

Latest commit

History

Repository files navigation

SNI-code identification by NLP

Commands

Workflows

Setup

Structure

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages