"The smuggle is real!"
A comprehensive proof-of-concept demonstrating vector-based data exfiltration techniques in AI/ML environments. This project illustrates potential risks in RAG systems and provides tools and concepts for defensive analysis.
VectorSmuggle demonstrates techniques for covert data exfiltration through vector embeddings, showcasing how sensitive information can be hidden within seemingly legitimate RAG operations. This research tool helps security professionals understand and defend against attack vectors in AI/ML systems.
- 🎭 Steganographic Techniques: Embedding obfuscation and data hiding
- 📄 Multi-Format Support: Process 15+ document formats (PDF, Office, email, databases)
- 🕵️ Evasion Capabilities: Behavioral camouflage and detection avoidance
- 🔍 Enhanced Query Engine: Data reconstruction and analysis
- 🐳 Production-Ready: Full containerization and Kubernetes deployment
- 📊 Analysis Tools: Comprehensive forensic and risk assessment capabilities
graph TB
A[Document Sources] --> B[Multi-Format Loaders]
B --> C[Content Preprocessors]
C --> D[Steganography Engine]
D --> E[Evasion Layer]
E --> F[Vector Stores]
F --> G[Enhanced Query Engine]
G --> H[Analysis & Recovery Tools]
subgraph "Core Modules"
B
C
D
E
G
H
end
subgraph "External Services"
F
I[OpenAI API]
J[Monitoring Systems]
end
- Python 3.11+
- OpenAI API key (or Ollama with nomic-embed-text:latest as fallback)
- Docker (optional)
- Kubernetes cluster (optional)
# Clone repository
git clone https://github.com/jaschadub/VectorSmuggle.git
cd VectorSmuggle
# Set up virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your API keys and settings
# Embed documents with steganographic techniques
cd scripts
python embed.py --files ../sample_docs/*.pdf --techniques noise,rotation,fragmentation
# Query and reconstruct data
python query.py --mode recovery --export results.json
# Generate risk assessment
cd ../analysis
python risk_assessment.py
For a comprehensive demonstration of VectorSmuggle's capabilities, try our interactive quickstart demo:
# Run the complete workflow demonstration
cd examples
python quickstart_demo.py
# With deterministic results
python quickstart_demo.py --seed 42
# Test specific techniques
python quickstart_demo.py --techniques noise rotation fragmentation
The quickstart demo demonstrates:
- ✅ End-to-end workflow: Document loading → Steganographic embedding → Vector storage → Query reconstruction
- ✅ Multiple techniques: Noise injection, rotation, scaling, and fragmentation across models
- ✅ Real sample data: Processes documents from
sample_docs/
(financial, HR, technical files) - ✅ Integrity verification: Validates successful encoding and decoding of hidden data
- ✅ Performance metrics: Shows processing times, success rates, and data statistics
Expected runtime: 10-30 seconds | Sample output: 6 documents → 45 chunks → 45 steganographic embeddings
See examples/README.md
for detailed setup instructions, troubleshooting, and expected outputs.
- 📖 Research Methodology - Research approach and validation
- ⚔️ Attack Vectors - Comprehensive attack analysis
- 🛡️ Defense Strategies - Countermeasures and detection
- ⚖️ Compliance Impact - Regulatory implications
- 🏗️ System Architecture - Design and components
- 📋 API Reference - Module documentation
- ⚙️ Configuration Guide - Setup and options
- 🔧 Troubleshooting - Common issues
- 🚀 Quick Start Guide - Getting started
- 🎯 Advanced Usage - Complex scenarios
- 🔒 Security Testing - Testing procedures
- 🚢 Deployment Guide - Production deployment
Advanced techniques for hiding data within vector embeddings:
from steganography import EmbeddingObfuscator, MultiModelFragmenter
# Apply noise-based steganography
obfuscator = EmbeddingObfuscator(noise_level=0.01)
hidden_embeddings = obfuscator.obfuscate(embeddings, techniques=["noise", "rotation"])
# Fragment across multiple models
fragmenter = MultiModelFragmenter()
fragments = fragmenter.fragment_and_embed(sensitive_data)
Support for diverse document types:
from loaders import DocumentLoaderFactory
factory = DocumentLoaderFactory()
documents = factory.load_documents([
"financial_report.pdf",
"employee_data.xlsx",
"emails.mbox",
"database_export.sqlite"
])
Sophisticated detection avoidance:
from evasion import BehavioralCamouflage, TrafficMimicry
# Simulate legitimate user behavior
camouflage = BehavioralCamouflage(legitimate_ratio=0.8)
camouflage.generate_cover_story("data analysis project")
# Mimic normal traffic patterns
mimicry = TrafficMimicry(base_interval=300.0)
await mimicry.execute_with_timing(upload_operation)
Advanced data reconstruction:
from query import AdvancedQueryEngine, DataRecoveryTools
engine = AdvancedQueryEngine(vector_store, llm, embeddings)
recovery = DataRecoveryTools(embeddings)
# Multi-strategy search and reconstruction
results = engine.multi_strategy_search("sensitive financial data")
reconstructed = recovery.recover_data(results)
Comprehensive security risk evaluation:
from analysis.risk_assessment import VectorExfiltrationRiskAssessor
assessor = VectorExfiltrationRiskAssessor()
assessment = assessor.perform_comprehensive_assessment(
documents, embeddings, config
)
print(f"Risk Level: {assessment.overall_risk_level}")
Digital forensics for incident investigation:
from analysis.forensic_tools import EvidenceCollector, TimelineReconstructor
collector = EvidenceCollector()
evidence = collector.collect_vector_store_evidence(vector_data)
reconstructor = TimelineReconstructor()
timeline = reconstructor.reconstruct_timeline(evidence)
Generate security detection rules:
from analysis.detection_signatures import StatisticalSignatureGenerator
generator = StatisticalSignatureGenerator()
generator.establish_baseline(clean_embeddings)
signatures = generator.generate_statistical_signatures()
Create legitimate traffic patterns:
from analysis.baseline_generator import BaselineDatasetGenerator
generator = BaselineDatasetGenerator()
dataset = generator.generate_baseline_dataset(
num_users=50, days=7
)
# Development environment
docker-compose -f docker-compose.yml -f docker-compose.dev.yml up -d
# Production environment
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
# Deploy to Kubernetes
kubectl apply -f k8s/ -n vectorsmuggle
# Check deployment status
kubectl get pods -n vectorsmuggle
kubectl rollout status deployment/vectorsmuggle -n vectorsmuggle
# Full deployment with monitoring
./scripts/deploy/deploy.sh --environment production --platform kubernetes --build
# Health check and validation
./scripts/deploy/health-check.sh --detailed --export health-report.json
# Core settings
OPENAI_API_KEY=sk-...
VECTOR_DB=qdrant
CHUNK_SIZE=512
# Embedding fallback settings
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_EMBEDDING_MODEL=nomic-embed-text:latest
# Steganography settings
STEGO_ENABLED=true
STEGO_TECHNIQUES=noise,rotation,fragmentation
STEGO_NOISE_LEVEL=0.01
# Evasion settings
EVASION_TRAFFIC_MIMICRY=true
EVASION_BEHAVIORAL_CAMOUFLAGE=true
EVASION_LEGITIMATE_RATIO=0.8
# Query settings
QUERY_CACHE_ENABLED=true
QUERY_MULTI_STEP_REASONING=true
QUERY_CONTEXT_RECONSTRUCTION=true
VectorSmuggle includes automatic fallback support for embedding models:
- Primary: OpenAI embeddings (requires API key)
- Fallback: Ollama with nomic-embed-text:latest (local)
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the embedding model
ollama pull nomic-embed-text:latest
# Start Ollama service
ollama serve
The system will automatically detect and use the available embedding provider.
See docs/technical/configuration.md
for comprehensive configuration options.
# Run linting and security checks
ruff check .
bandit -r . -f json
# Type checking
mypy .
This project represents a proof-of-concept implementation designed for educational and research purposes. The techniques demonstrated require rigorous experimental validation before any performance claims can be substantiated. (IN-PROGRESS)
Metrics Definition:
- Capacity: Measured in bits per embedding dimension with statistical significance testing
- Detection Resistance: Defined using specific metrics (ROC-AUC, F1-score) with confidence intervals
- Fidelity: Cosine similarity preservation with variance analysis
- Covert Data Exfiltration: Embedding systems can leak sensitive data without detection
- DLP Bypass: Traditional Data Loss Prevention tools cannot detect semantic leaks via vectors
- Insider Threats: Malicious actors can pose as legitimate LLM/RAG engineers
- External Storage: Sensitive data stored in third-party vector databases
- Steganographic Hiding: Data concealed within legitimate-looking embeddings
- Behavioral Camouflage: Attack activities disguised as normal user behavior
- Egress Monitoring: Monitor outbound connections to vector databases
- Embedding Analysis: Statistical analysis of vector spaces for anomalies
- Behavioral Detection: User activity pattern analysis
- Content Sanitization: Remove sensitive information before embedding
- Access Controls: Strict permissions and authentication requirements
- Audit Logging: Comprehensive logging of all embedding operations
- Red team exercises and attack simulations
- Blue team defense strategy development
- Security awareness training programs
- Incident response scenario planning
- Academic security research projects
- Vulnerability assessment methodologies
- Defense mechanism development
- Threat modeling frameworks
- Regulatory compliance validation
- Data protection impact assessments
- Security control effectiveness testing
- Risk assessment procedures
We welcome contributions from the security research community:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Follow the existing code style and conventions
- Add comprehensive tests for new features
- Update documentation for any changes
- Ensure all security checks pass
- Include educational value in contributions
This project is licensed under the MIT License - see the LICENSE file for details.
IMPORTANT: This repository and its contents are intended for educational and ethical security research purposes only.
- Any actions or activities related to this material are solely your responsibility
- Misuse of these tools or techniques to access unauthorized data is illegal and unethical
- The authors assume no liability for any misuse or damage caused by this material
- Always obtain proper authorization before performing any security testing
For questions, suggestions, or responsible disclosure of security issues:
- General Questions: Open an issue on GitHub
- Research Collaboration: Contact the maintainer
If you use VectorSmuggle in your research, please cite it as follows:
@software{vectorsmuggle2025,
title={VectorSmuggle: A Comprehensive Framework for Vector-Based Data Exfiltration Research},
author={Wanger, Jascha},
organization={Tarnover, LLC},
year={2025},
url={https://github.com/jaschadub/VectorSmuggle},
note={Educational security research framework for AI/ML systems}
}
Wanger, J. (2025). VectorSmuggle: A Comprehensive Framework for Vector-Based Data Exfiltration Research [Computer software]. Tarnover, LLC. https://github.com/jaschadub/VectorSmuggle
J. Wanger, "VectorSmuggle: A Comprehensive Framework for Vector-Based Data Exfiltration Research," Tarnover, LLC, 2025. [Online]. Available: https://github.com/jaschadub/VectorSmuggle
Remember: This tool is designed to help improve security through education and research. Use responsibly and ethically.
Portions of this code are generated, tested, and audited using advanced AI models. ThirdKey can help secure your AI infrastructure.