-
Notifications
You must be signed in to change notification settings - Fork 141
Open
Description
Hi,
We appreciate the author's dedicated efforts in compiling research on cybersecurity LLMs, which is highly beneficial to the entire field!
We hope the author would consider incorporating our latest work 🤗. We have open-sourced the first large-scale cybersecurity pretraining dataset. In addition, we also provide specialized datasets for cybersecurity IFT, reasoning with reflection for distillation, and the cybersecurity LLMs trained with them.
📄 Paper: https://arxiv.org/abs/2502.11191
🤗 Hugging Face: https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243
What We Do?
Datasets:
- Primus-Seed: Manually collected high-quality cybersecurity texts, including MITRE, Wikipedia, blogs, books, etc.
- Primus-FineWeb: A cybersecurity text classifier was trained to filter 2.57B tokens from FineWeb (a refined version of Common Crawl).
- Primus-Instruct: Includes approximately 1K QA pairs covering common cybersecurity business scenarios.
- Primus-Reasoning: Includes reasoning and reflection data generated by o1-preview on cybersecurity tasks for distillation.
Cybersecurity LLMs:
- Llama-Primus-Base: Based on Llama-3.1-8B-Instruct, continually pretrained on 2.77B tokens of cybersecurity text (Primus-Seed and Primus-FineWeb), achieving a 🚀15.88% improvement in the aggregated score across multiple cybersecurity benchmarks.
- Llama-Primus-Merged: While maintaining nearly the same instruction-following capability as Llama-3.1-8B-Instruct, achieving a 🚀14.84% improvement across multiple cybersecurity benchmarks.
- Llama-Primus-Reasoning: First cybersecurity reasoning model! Distilled on reasoning and reflection data from o1-preview for cybersecurity tasks (Primus-Reasoning), achieving a 🚀10% improvement on CISSP.
We look forward to your feedback and hope this can contribute to advancing cybersecurity AI community! 🚀
Metadata
Metadata
Assignees
Labels
No labels