Consider Including New Open Cybersecurity LLM & Datasets

**Hi,**  

We appreciate the author's dedicated efforts in compiling research on cybersecurity LLMs, which is highly beneficial to the entire field!


We hope the author would consider incorporating our latest work 🤗. We have open-sourced the **first** large-scale cybersecurity pretraining dataset. In addition, we also provide specialized datasets for cybersecurity IFT, reasoning with reflection for distillation, and the cybersecurity LLMs trained with them.  

📄 **Paper:** [https://arxiv.org/abs/2502.11191](https://arxiv.org/abs/2502.11191)  
🤗 **Hugging Face:** [https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243](https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243)  

### **What We Do?**
#### **Datasets:**
- **Primus-Seed:** Manually collected high-quality cybersecurity texts, including MITRE, Wikipedia, blogs, books, etc.  
- **Primus-FineWeb:** A cybersecurity text classifier was trained to filter 2.57B tokens from FineWeb (a refined version of Common Crawl).  
- **Primus-Instruct:** Includes approximately 1K QA pairs covering common cybersecurity business scenarios.  
- **Primus-Reasoning:** Includes reasoning and reflection data generated by o1-preview on cybersecurity tasks for distillation.  

#### **Cybersecurity LLMs:**
- **Llama-Primus-Base:** Based on Llama-3.1-8B-Instruct, continually pretrained on 2.77B tokens of cybersecurity text (_Primus-Seed_ and _Primus-FineWeb_), achieving a 🚀15.88% improvement in the aggregated score across multiple cybersecurity benchmarks.  
- **Llama-Primus-Merged:** While maintaining nearly the same instruction-following capability as Llama-3.1-8B-Instruct, achieving a 🚀14.84% improvement across multiple cybersecurity benchmarks.  
- **Llama-Primus-Reasoning:** First cybersecurity reasoning model! Distilled on reasoning and reflection data from o1-preview for cybersecurity tasks (_Primus-Reasoning_), achieving a 🚀10% improvement on CISSP.  

We look forward to your feedback and hope this can contribute to advancing cybersecurity AI community! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider Including New Open Cybersecurity LLM & Datasets #45

What We Do?

Datasets:

Cybersecurity LLMs:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider Including New Open Cybersecurity LLM & Datasets #45

Description

What We Do?

Datasets:

Cybersecurity LLMs:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions