Skip to content

matiasrodlo/veritas

Repository files navigation

Veritas: A Scientist for Autonomous Research

One of the grand challenges of artificial intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have been used to assist human scientists—for example, in brainstorming ideas or writing code—they still require extensive manual supervision or are constrained to narrow, task-specific use cases.

Veritas is a comprehensive system for fully automatic scientific discovery, enabling Foundation Models such as Large Language Models (LLMs) to perform research independently. It runs locally on Mistral 7B, ensuring full data privacy, minimizing citation hallucinations through Retrieval-Augmented Generation (RAG), and supporting customizable scientific writing styles via QLoRA (Quantized Low-Rank Adaptation). Veritas also integrates LongLoRA for context extension, allowing input windows of over 100,000 tokens to support long-form research workflows.

Veritas was developed during Major League Hacking’s Global Hack Week: Open Source (May 9–15, 2025). At the end of the week, MLH ranked me among the top 1% of participating hackers. I am convinced that in the coming years, tools like Veritas will evolve significantly and drive a paradigm shift, playing a leading role in the production of scientific knowledge.

Research Outputs

Machine learning research papers across a range of emerging topics, including diffusion modeling, language generation, and grokking dynamics:

Note: While all core modules of Veritas have been validated, a production-grade RAG pipeline is still under development. The papers currently showcased were generated by the AI Scientist.

Research Workflow

Veritas mirrors the architecture of The AI Scientist (GPT4) and implements the full research pipeline:

AI Scientist Architecture

1. Idea Generation

  • Receives a topic template
  • Brainstorms novel research directions
  • Validates novelty using Semantic Scholar

2. Experimental Iteration

  • Executes code for proposed methods
  • Collects outputs and visualizations
  • Annotates each result for interpretation

3. Paper Write-up

  • Generates a LaTeX-formatted scientific paper
  • Autonomously sources relevant citations

4. Automated Peer Review

  • Uses a custom LLM reviewer aligned with ML conference standards
  • Evaluates novelty, clarity, rigor
  • Feeds back into the system for future iterations

System Requirements

  • Apple Silicon Mac (M1, M2, M3, or M4)
  • macOS Monterey or later
  • 16GB RAM minimum (32GB+ recommended, 128GB optimal for M4)
  • 8GB+ free storage (SSD recommended)
  • Python 3.9 or higher

Installation

We provide a unified installation script that handles everything for you:

# Clone the repository
git clone https://github.com/yourusername/veritas.git
cd veritas

# Basic installation (using convenience script)
./install.sh

# Or directly from tools directory
python tools/install.py

# To also download the Mistral model (optional, 13GB+)
python tools/install.py --download-model

# More installation options
python tools/install.py --upgrade                 # Upgrade existing dependencies
python tools/install.py --ignore-errors           # Continue even if some steps fail
python tools/install.py --skip-dependencies       # Skip installing dependencies
python tools/install.py --model "mistralai/Mistral-7B-v0.2"  # Specify model to download

The installation script:

  1. Creates necessary directories
  2. Installs all dependencies for both RAG and AI Scientist
  3. Sets up the package for development
  4. Creates basic research templates for AI Scientist
  5. Optionally downloads the Mistral model

After installation, you can use the command-line tools:

# Use main interface
veritas

# Use AI Scientist directly
veritas-ai-scientist

# See all available options
veritas --help

Manual Installation

If you prefer manual installation:

  1. Clone the repository:

    git clone https://github.com/yourusername/veritas.git
    cd veritas
  2. Install dependencies:

    pip install -r requirements.txt
  3. Install the package:

    pip install -e .
  4. Download and prepare the Mistral model (if needed):

    mkdir -p models/mistral
    python -c "from huggingface_hub import snapshot_download; snapshot_download('mistralai/Mistral-7B-v0.2', local_dir='models/mistral')"

Quick Start

Run the unified terminal interface:

# Start with RAG system (default)
python scripts/run.py

# Start with AI Scientist
python scripts/run.py --system ai_scientist

# Show all options
python scripts/run.py --help

Using the RAG System

The RAG system allows you to ask questions about your documents:

python scripts/run.py

This will start the RAG system with the terminal UI, where you can directly ask questions.

Using AI Scientist

To use the AI Scientist component:

# Direct launch
python scripts/run.py --system ai_scientist

# Or start with RAG and switch
python scripts/run.py
# Then type 'scientist' at the prompt

Or run a simple test:

# Navigate to the AI Scientist directory
cd src/veritas/ai_scientist

# Simple test that generates one idea
python test_simple.py

For more information, see the AI Scientist README.

Architecture

Veritas is designed with a clear separation of concerns:

  • Core RAG Implementation (src/veritas/rag.py): The heart of the system that handles retrieval and generation
  • Application Layer (scripts/run.py): Configures and uses the core RAG system for specific use cases
  • Configuration (src/veritas/config.py): Centralized settings for the entire system
  • Apple Silicon Optimizations (src/veritas/mps_utils.py): Specialized utilities for Apple's Metal framework
  • Text Processing (src/veritas/chunking.py): Document segmentation for efficient indexing and retrieval
  • AI Scientist (src/veritas/ai_scientist): Research assistant built on top of our RAG system

UML Class Diagram

┌─────────────┐     ┌───────────────┐
│ MistralModel│     │   RAGSystem   │
│ (run.py)    │────>│  (rag.py)     │
└─────────────┘     └───────────────┘
       │                   │
       │                   │
       ▼                   ▼
┌─────────────┐     ┌───────────────┐
│ ModelConfig │     │    Config     │
└─────────────┘     └───────────────┘
                           │
                           ▼
                    ┌───────────────┐
                    │  mps_utils    │
                    └───────────────┘

Core Components

RAGSystem (src/veritas/rag.py)

The main class that implements the RAG functionality:

from veritas import RAGSystem

# Create a RAG system
rag = RAGSystem(
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    llm_model="models/mistral-7b",
    index_path="models/faiss",
    device="mps"  # Use Apple Silicon acceleration
)

# Generate a complete RAG response
response = rag.generate_rag_response(
    query="How does a RAG system work?",
    top_k=5,  # Number of chunks to retrieve
    max_new_tokens=200
)

print(response["combined_response"])

MistralModel (scripts/run.py)

A wrapper around RAGSystem that handles configuration and initialization:

from src.veritas.config import Config
from scripts.run import MistralModel, ModelConfig

# Configure the model
config = ModelConfig(
    model_name=Config.LLM_MODEL,
    max_new_tokens=200,
    temperature=0.3,
    max_retrieved_chunks=3
)

# Create and load model
model = MistralModel(config)
model.load()

# Generate a response with context
context, direct_response, combined_response = model.generate(
    "What are the advantages of RAG systems over pure LLMs?"
)

AI Scientist (src/veritas/ai_scientist)

A research assistant built on top of our Mistral model with RAG capabilities:

from src.veritas.ai_scientist.run_scientist import AIScientist

# Create an AI Scientist instance
scientist = AIScientist(
    experiment="nanoGPT_lite", 
    num_ideas=1
)

# Generate research ideas
ideas = scientist.generate_ideas()

# Print the generated ideas
for idea in ideas:
    print(f"Idea: {idea['title']}")
    print(f"Description: {idea['description']}")
    print(f"Novelty: {idea['novelty_score']}")

Advanced Usage

Custom Document Chunking

from veritas import chunk_text, get_chunk_size

# Get optimal chunk size based on document length
document_length = len(large_document)
chunk_size = get_chunk_size(document_length, target_chunks=20)

# Generate chunks with custom parameters
chunks = chunk_text(
    text=large_document,
    chunk_size=chunk_size,
    overlap=100  # Words of overlap between chunks
)

# Process each chunk
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk[:50]}...")

Memory Optimization

from veritas.mps_utils import optimize_memory_for_m4, clear_mps_cache

# Apply comprehensive M4 optimizations at startup
optimize_memory_for_m4()

# Clear cache after heavy operations
result = model.generate(complex_query)
clear_mps_cache()  # Free up GPU memory

Switching Between RAG and AI Scientist

The unified interface allows you to switch between modes during a session:

# Start with RAG
python scripts/run.py

# Type 'scientist' at the prompt to switch to AI Scientist mode
# Select option 4 to return to RAG mode

Performance Optimization

Veritas includes several optimizations for Apple Silicon:

  1. MPS Acceleration: Uses Metal Performance Shaders for faster computation
  2. Memory Management: Carefully controls memory usage to prevent OOM errors
  3. Half-Precision: Uses FP16 where possible for better performance
  4. Caching Control: Explicit cache clearing to prevent memory leaks
  5. SSD Offloading: Uses SSD for temporary files to reduce RAM pressure

About

A Scientist for Autonomous Research (2025)

Topics

Resources

Stars

Watchers

Forks