Sylint is an AI-powered, browser-based static vulnerability scanner that combines deep code embeddings (UniXcoder) with retrieval-augmented generation (RAG) via LLMs to detect, explain, and debug security issues with greater precision than traditional static analyzers or basic LLM-based tools. Sylint goes beyond conventional static analysis by understanding code semantically and grounding AI explanations in real-world vulnerability patterns. in 9 different programming languages
Built on UniXcoder (microsoft/unixcoder-base) by Microsoft, a model trained across source code, comments, and ASTs.
Supports 9 languages: Python, JavaScript, Java, C, C++, PHP, Ruby, Go, and TypeScript.
Encodes code into 768-dimensional embeddings that capture logic, not just syntax.
Detects vulnerabilities even if code is obfuscated (e.g., renamed variables, reordered structures).
Uses a filtered version of the CVEfixes dataset, focused solely on vulnerable samples from the NVD.
Associates each sample with language, CVE IDs, and CWE vulnerability classes.
Prioritizes matching real-world exploit patterns rather than safe or patched code.
On code submission, Sylint retrieves similar vulnerabilities based on embeddings.
These examples inform the LLM's explanation, grounding it in real-world evidence.
Results in more accurate, consistent, and trusted vulnerability analysis.
Frontend: Next.js (React + TypeScript) + Tailwind CSS
Backend:
- Convex — Realtime database and serverless backend
- FastAPI — Python service for AI model interaction (embeddings + LLM calls)
Authentication: Clerk.dev
AI Models:
- UniXcoder (microsoft/unixcoder-base) — Code embedding model
- Groq API (Mixtral llama-3.3-70b-versatile) — Vulnerability explanation model
Vector Database:
- Pinecone
- Monaco-based code editor with multi-language support
- Syntax highlighting and language-aware defaults
- Upload code and trigger vulnerability scans
- Code embedding generation using UniXcoder
- Similarity search against known vulnerable codebases
- Retrieval-augmented vulnerability explanations
- Save and view scan history (Convex backend)
- Clerk authentication with Pro subscription gating
- Webhook integration with Lemon Squeezy
- Full HTTPS (SSL-secured) frontend/backend communication
- Automatic CWE/CVE tagging based on LLM analysis
- Exportable vulnerability reports (PDF/Markdown)
- Clone and set up the project
git clone https://github.com/your-username/sylint.git
cd sylint
npm install
- Run Frontend (Next.js):
npm run dev
- Run Convex Backend:
npx convex dev
- Run FastAPI service (embeddings and explanations)
cd ai-service
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --reload --port 8000
Make sure your .env file contains:
- GROQ_API_KEY
- Clerk keys (frontend/backend)
- POST /embed — Generate 768-dimension code embedding via UniXcoder
- POST /explain — Generate vulnerability explanation via Groq (Mixtral model)
Store embeddings of vulnerable code samples (CWE/CVE dataset).
Enable semantic similarity search on user-submitted code.
Power retrieval-augmented LLM explanations.
Sylint can detect vulnerabilities even when the submitted code looks different from the original vulnerable example.
Model: Mixtral (Llama 3.3 70B Versatile) via Groq API
Responsibilities:
- Deep vulnerability reasoning
- CWE/CVE tagging
- Auto-fix patch suggestions
- Exportable full vulnerability reports
- User submits source code (JavaScript, Python, etc.)
- Code is embedded using UniXcoder
- Query sent to Pinecone (vector DB) containing ~4,000 vulnerable code snippets
- Retrieve top-k most semantically similar code examples
- Forward original code + top-k results to LLM (Groq’s Mixtral, llama-3, etc.)
- LLM returns:
- Explanation of vulnerability
- CWE classification
- (Optionally) suggested fix
This approach is a form of Retrieval-Augmented Generation (RAG).
It uses semantic similarity to ground the LLM’s response with real-world examples, enabling detection of non-pattern-matching, obfuscated, or fuzzy vulnerabilities.
Add a second Pinecone index (or namespace) for compliance rules:
- Sources: NIST SP 800-53, PCI DSS v4.0, HIPAA, OWASP ASVS
- Each rule is embedded as text (same model or a text-focused one)
- Step 1: User submits code
- Step 2: Embed the code
- Step 3: Run two vector searches:
- Against vulnerable code examples
- Against compliance rule embeddings
- Step 4: Combine both sets of results + user code and send to LLM
LLM returns:
- Vulnerability summary
- CWE mapping
- Matched compliance rules (e.g., “Violates PCI 6.2.4”, “Conflicts with NIST SI-10”)
- (Optional) Fix recommendation contextualized by regulation
Users can optionally select:
- A compliance mode (e.g., “PCI only”, “HIPAA only”)
- A vulnerability scan scope (e.g., “Common CWEs only”, “Critical CWEs”)
This is handled by adding metadata to Pinecone records:
{
"type": "compliance_rule",
"source": "PCI",
"id": "PCI 6.2.4",
"cwe": "CWE-79"
}
Then filter retrieval using Pinecone’s metadata filtering:
pinecone.query({
vector: embeddedCode,
topK: 5,
filter: { source: "PCI" }
});
RAG still applies — you’re just narrowing retrieval to user-defined categories.
- Semgrep
- Bandit
- ESLint
- SonarQube
- Fortify
- Fast and deterministic
- Easy CI/CD integration
- Rule-based, low compute cost
- No hallucination
- Rigid pattern matching (can’t generalize)
- Can’t reason about complex logic or control flow
- High false positive rate if rules aren’t tightly scoped
- Weak cross-language/generalization ability
- Uses semantic similarity instead of regex/patterns
- Can detect non-exact, obfuscated, or stylistic vulnerabilities
- Explanations in plain English (good for junior devs or audits)
- Compliance-aware output, not just raw CVEs or rule flags
- Supports multi-language input via embedding model
- Easily extensible with new rules, frameworks, or CWE mappings
- Run Semgrep (or other tools) first
- Pass flagged lines + rule metadata into the LLM
- Example prompt:
"Semgrep flagged line 12 for CWE-89 (SQLi). Does this look valid? Why or why not?"
- Example prompt:
- Let users write their own rules in natural language or regex
- Embed those rules into Pinecone with metadata
- Future scans check user-defined rules as part of vector retrieval
- Use different LLMs for:
- Code interpretation (e.g., Claude or CodeLlama)
- Compliance explanation (e.g., GPT-4)
- Fast response generation (e.g., Groq Mixtral)
- Report vector distance of matches (e.g., “match = 0.87 similarity”)
- Optionally add LLM “certainty” estimates to guide reviewers
- Show diff between original and fixed version
- Allow users to accept, modify, or reject changes
- Could be followed by a re-scan to validate the fix
- Let users label results as “correct”, “false positive”, or “incomplete”
- Fine-tune embedding weights, re-rank results, or improve LLM response strategy
Term | Explanation |
---|---|
RAG (Retrieval-Augmented Generation) | Enhance LLM output by retrieving external context (e.g., code examples, rules) |
Embedding | Vector representation of code or text |
Vector DB (Pinecone) | Specialized database for fast semantic search |
LLM | Large language model used to generate explanations |
CWE | Common Weakness Enumeration: standardized vulnerability IDs |
Parameterized Retrieval | User-driven filters applied to vector search (e.g., by standard, severity, CWE ID) |
Compliance Frameworks | Regulatory standards like PCI, HIPAA, NIST that define what “secure” means |
Static Analysis | Traditional method of finding code flaws without running code |
This is the current mental model and roadmap for building out Sylint as a semantically-aware, compliance-integrated vulnerability scanner.