Skip to content

AI-powered browser-based vulnerability scanner using UniXcoder embeddings and RAG with LLM to detect security flaws across 9 languages.

Notifications You must be signed in to change notification settings

butlerem/vulnerability-scanner-UniXcoder-RAG

Repository files navigation

Sylint

Sylint is an AI-powered, browser-based static vulnerability scanner that combines deep code embeddings (UniXcoder) with retrieval-augmented generation (RAG) via LLMs to detect, explain, and debug security issues with greater precision than traditional static analyzers or basic LLM-based tools. Sylint goes beyond conventional static analysis by understanding code semantically and grounding AI explanations in real-world vulnerability patterns. in 9 different programming languages

Code Understanding with UniXcoder

Built on UniXcoder (microsoft/unixcoder-base) by Microsoft, a model trained across source code, comments, and ASTs.

Supports 9 languages: Python, JavaScript, Java, C, C++, PHP, Ruby, Go, and TypeScript.

Encodes code into 768-dimensional embeddings that capture logic, not just syntax.

Detects vulnerabilities even if code is obfuscated (e.g., renamed variables, reordered structures).

Vulnerability-Driven Dataset (CVEfixes)

Uses a filtered version of the CVEfixes dataset, focused solely on vulnerable samples from the NVD.

Associates each sample with language, CVE IDs, and CWE vulnerability classes.

Prioritizes matching real-world exploit patterns rather than safe or patched code.

Retrieval-Augmented Generation (RAG) for Explanations

On code submission, Sylint retrieves similar vulnerabilities based on embeddings.

These examples inform the LLM's explanation, grounding it in real-world evidence.

Results in more accurate, consistent, and trusted vulnerability analysis.

Tech Stack

Frontend: Next.js (React + TypeScript) + Tailwind CSS

Backend:

  • Convex — Realtime database and serverless backend
  • FastAPI — Python service for AI model interaction (embeddings + LLM calls)

Authentication: Clerk.dev

AI Models:

  • UniXcoder (microsoft/unixcoder-base) — Code embedding model
  • Groq API (Mixtral llama-3.3-70b-versatile) — Vulnerability explanation model

Vector Database:

  • Pinecone

Key Features

  • Monaco-based code editor with multi-language support
  • Syntax highlighting and language-aware defaults
  • Upload code and trigger vulnerability scans
  • Code embedding generation using UniXcoder
  • Similarity search against known vulnerable codebases
  • Retrieval-augmented vulnerability explanations
  • Save and view scan history (Convex backend)
  • Clerk authentication with Pro subscription gating
  • Webhook integration with Lemon Squeezy
  • Full HTTPS (SSL-secured) frontend/backend communication
  • Automatic CWE/CVE tagging based on LLM analysis
  • Exportable vulnerability reports (PDF/Markdown)

Getting Started

  1. Clone and set up the project
git clone https://github.com/your-username/sylint.git
cd sylint
npm install
  1. Run Frontend (Next.js):
npm run dev
  1. Run Convex Backend:
npx convex dev
  1. Run FastAPI service (embeddings and explanations)
cd ai-service

python3 -m venv venv

source venv/bin/activate

pip install -r requirements.txt

uvicorn main:app --reload --port 8000

Make sure your .env file contains:

  • GROQ_API_KEY
  • Clerk keys (frontend/backend)

AI Service Endpoints

  • POST /embed — Generate 768-dimension code embedding via UniXcoder
  • POST /explain — Generate vulnerability explanation via Groq (Mixtral model)

Vector Database Setup

Store embeddings of vulnerable code samples (CWE/CVE dataset).

Enable semantic similarity search on user-submitted code.

Power retrieval-augmented LLM explanations.

Sylint can detect vulnerabilities even when the submitted code looks different from the original vulnerable example.

LLM Integration Details

Model: Mixtral (Llama 3.3 70B Versatile) via Groq API

Responsibilities:

  • Deep vulnerability reasoning
  • CWE/CVE tagging
  • Auto-fix patch suggestions
  • Exportable full vulnerability reports

Sylint Notes on Architecture, RAG Usage, and Next Steps

Current Architecture (RAG-Based Vulnerability Scanner)

  • User submits source code (JavaScript, Python, etc.)
  • Code is embedded using UniXcoder
  • Query sent to Pinecone (vector DB) containing ~4,000 vulnerable code snippets
  • Retrieve top-k most semantically similar code examples
  • Forward original code + top-k results to LLM (Groq’s Mixtral, llama-3, etc.)
  • LLM returns:
    • Explanation of vulnerability
    • CWE classification
    • (Optionally) suggested fix

This approach is a form of Retrieval-Augmented Generation (RAG).
It uses semantic similarity to ground the LLM’s response with real-world examples, enabling detection of non-pattern-matching, obfuscated, or fuzzy vulnerabilities.

Planned Improvements: Compliance-Aware RAG

Add a second Pinecone index (or namespace) for compliance rules:

  • Sources: NIST SP 800-53, PCI DSS v4.0, HIPAA, OWASP ASVS
  • Each rule is embedded as text (same model or a text-focused one)

Dual Retrieval on Submission:

  • Step 1: User submits code
  • Step 2: Embed the code
  • Step 3: Run two vector searches:
    1. Against vulnerable code examples
    2. Against compliance rule embeddings
  • Step 4: Combine both sets of results + user code and send to LLM

LLM returns:

  • Vulnerability summary
  • CWE mapping
  • Matched compliance rules (e.g., “Violates PCI 6.2.4”, “Conflicts with NIST SI-10”)
  • (Optional) Fix recommendation contextualized by regulation

Parameterized Retrieval

Users can optionally select:

  • A compliance mode (e.g., “PCI only”, “HIPAA only”)
  • A vulnerability scan scope (e.g., “Common CWEs only”, “Critical CWEs”)

This is handled by adding metadata to Pinecone records:

{
  "type": "compliance_rule",
  "source": "PCI",
  "id": "PCI 6.2.4",
  "cwe": "CWE-79"
}

Then filter retrieval using Pinecone’s metadata filtering:

pinecone.query({
  vector: embeddedCode,
  topK: 5,
  filter: { source: "PCI" }
});

RAG still applies — you’re just narrowing retrieval to user-defined categories.

Static Analysis Tools Comparison

Examples:

  • Semgrep
  • Bandit
  • ESLint
  • SonarQube
  • Fortify

Strengths:

  • Fast and deterministic
  • Easy CI/CD integration
  • Rule-based, low compute cost
  • No hallucination

Limitations:

  • Rigid pattern matching (can’t generalize)
  • Can’t reason about complex logic or control flow
  • High false positive rate if rules aren’t tightly scoped
  • Weak cross-language/generalization ability

What Makes Sylint Unique

  • Uses semantic similarity instead of regex/patterns
  • Can detect non-exact, obfuscated, or stylistic vulnerabilities
  • Explanations in plain English (good for junior devs or audits)
  • Compliance-aware output, not just raw CVEs or rule flags
  • Supports multi-language input via embedding model
  • Easily extensible with new rules, frameworks, or CWE mappings

Future Expansion Ideas

1. LLM + Static Tool Hybrid

  • Run Semgrep (or other tools) first
  • Pass flagged lines + rule metadata into the LLM
    • Example prompt:
      "Semgrep flagged line 12 for CWE-89 (SQLi). Does this look valid? Why or why not?"

2. Custom Rule Builder

  • Let users write their own rules in natural language or regex
  • Embed those rules into Pinecone with metadata
  • Future scans check user-defined rules as part of vector retrieval

3. Multi-LLM Routing

  • Use different LLMs for:
    • Code interpretation (e.g., Claude or CodeLlama)
    • Compliance explanation (e.g., GPT-4)
    • Fast response generation (e.g., Groq Mixtral)

4. Confidence Scoring

  • Report vector distance of matches (e.g., “match = 0.87 similarity”)
  • Optionally add LLM “certainty” estimates to guide reviewers

5. Interactive Fix Mode

  • Show diff between original and fixed version
  • Allow users to accept, modify, or reject changes
  • Could be followed by a re-scan to validate the fix

6. Feedback Loop

  • Let users label results as “correct”, “false positive”, or “incomplete”
  • Fine-tune embedding weights, re-rank results, or improve LLM response strategy

Key Concepts (For Reference)

Term Explanation
RAG (Retrieval-Augmented Generation) Enhance LLM output by retrieving external context (e.g., code examples, rules)
Embedding Vector representation of code or text
Vector DB (Pinecone) Specialized database for fast semantic search
LLM Large language model used to generate explanations
CWE Common Weakness Enumeration: standardized vulnerability IDs
Parameterized Retrieval User-driven filters applied to vector search (e.g., by standard, severity, CWE ID)
Compliance Frameworks Regulatory standards like PCI, HIPAA, NIST that define what “secure” means
Static Analysis Traditional method of finding code flaws without running code

This is the current mental model and roadmap for building out Sylint as a semantically-aware, compliance-integrated vulnerability scanner.

About

AI-powered browser-based vulnerability scanner using UniXcoder embeddings and RAG with LLM to detect security flaws across 9 languages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published