NER for cybersecurity

Report

Replicate

Experiments were performed using conda on Linux with Intel CPU. In order to replicate the experiments run:

# ensure you have installed conda, initialized it and sourced the bashrc.
conda env create -n nlp-cyber-ner -f envs/prod_environment.yml
conda activate nlp-cyber-ner

<<<<<<< Updated upstream
# For BERT-base-NER and LST-ner
=======
# For BERT Baseline and LST-NER models (separate environment)
>>>>>>> Stashed changes
conda env create -f envs/bert_lstner_requirements.yaml
conda activate nlp-cyber-ner-bert-lstner

All experiments had seed set.

Experiments and associated commits are in our online MLFlow server: MLFlow Server (DAGsHub).

NLP-Cyber-NER Experiments

This project provides scripts to execute a variety of NER experiments for cybersecurity datasets. The main experiment types include:

Combined Dataset Model: Trains and evaluates a model on the union of all datasets.
Cross Dataset Model: Trains and evaluates models across different datasets (e.g., train on one, evaluate on another).
Multihead (Tokenmodel) Experiments: Trains models with multiple heads for different datasets, supporting various architectural variants (e.g., tied/untied embeddings and LSTMs). <<<<<<< Updated upstream LST-NER Model: Cross-domain NER using label structure transfer with graph neural networks and optimal transport. BERT Baseline Model: Standard BERT NER fine-tuning for baseline comparison. =======
LST-NER Model: Cross-domain NER using label structure transfer with graph neural networks and optimal transport.
BERT Baseline Model: Standard BERT NER fine-tuning for baseline comparison.

Stashed changes

python nlp_cyber_ner/modeling/cross_dataset_model.py
python nlp_cyber_ner/modeling/combined_dataset_model.py
python nlp_cyber_ner/modeling/*tokenmodel*.py # different

# LST-NER and BERT Baseline (requires separate environment)
<<<<<<< Updated upstream
**Before running:** Update the dataset paths in the configuration section of each script to match your desired dataset.
=======
>>>>>>> Stashed changes
python nlp_cyber_ner/modeling/train_lst_ner.py
python nlp_cyber_ner/modeling/train_bert_baseline.py

Experiment Tracking with MLflow

All experiments are automatically logged to MLflow. You can set the environment variable MLFLOW_TRACKING_URI to log results to an external MLflow tracking server. Otherwise, results are logged locally by default.

After running experiments, you can launch the MLflow UI on your local machine to browse results:

mlflow ui

This will start a web server where you can explore experiment runs, metrics, and artifacts.

Looking up predictions

<<<<<<< Updated upstream Cross-dataset models and Combined-dataset model, LST-NER, and BERT baseline models all output predictions into `artifacts/predictions` folder.

Cross-dataset models and Combined-dataset model, BERT-base-NER model, and LST-NER model output predictions into artifacts/predictions folder.

Stashed changes These can be for example compared to ground truth using our lookup tools lookup.html and lookup2.html.

Development

Development was performed with venv, packages are in dev_requirements.txt.

Cite

Please cite this work if you use it.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.vscode		.vscode
artifacts		artifacts
data		data
docs		docs
envs		envs
nlp_cyber_ner		nlp_cyber_ner
notebooks		notebooks
reports		reports
scripts/itu_hpc		scripts/itu_hpc
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NER for cybersecurity

Report

Replicate

NLP-Cyber-NER Experiments

Experiment Tracking with MLflow

Looking up predictions

<<<<<<< Updated upstream Cross-dataset models and Combined-dataset model, LST-NER, and BERT baseline models all output predictions into `artifacts/predictions` folder.

Development

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

PLtier/NLP-Cyber-NER

Folders and files

Latest commit

History

Repository files navigation

NER for cybersecurity

Report

Replicate

NLP-Cyber-NER Experiments

Experiment Tracking with MLflow

Looking up predictions

<<<<<<< Updated upstream Cross-dataset models and Combined-dataset model, LST-NER, and BERT baseline models all output predictions into artifacts/predictions folder.

Development

Cite

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

<<<<<<< Updated upstream Cross-dataset models and Combined-dataset model, LST-NER, and BERT baseline models all output predictions into `artifacts/predictions` folder.

Packages