Research Rover

Introduction

Research Rover is a modern, AI-powered web application designed to help researchers and students efficiently discover, analyze, and interact with academic research papers. The application provides an intelligent interface for paper discovery, full-text extraction, semantic search, and AI-powered chat functionality, making research more efficient and insightful.

Tech Stack

Backend

Python 3.11+
FastAPI: Modern, fast web framework for building REST APIs
Uvicorn: ASGI server for FastAPI
Pydantic: Data validation and settings management
Sentence Transformers: AI embeddings for semantic search
FAISS: Vector similarity search and clustering
Crawl4ai: Advanced web scraping for full-text extraction
BeautifulSoup4: HTML parsing and content extraction
Pandas: Data manipulation and analysis
Google Generative AI: LLM integration for chat functionality
Asyncio: Asynchronous programming support

Frontend

React 18: Modern UI library with hooks
TypeScript: Type-safe JavaScript development
Vite: Fast build tool and development server
Tailwind CSS: Utility-first CSS framework
Shadcn/ui: Modern component library
Framer Motion: Smooth animations and transitions
React Router: Client-side routing
Lucide React: Beautiful icon library

Features

Currently Implemented

Intelligent Paper Search
- PubMed integration with advanced query optimization
- Real-time search progress tracking with step indicators
- Smart query enhancement (abbreviation expansion, typo correction)
- Comprehensive metadata extraction (DOI, authors, keywords, abstracts)
- Paginated results with client-side filtering
Full-Text Extraction & Processing
- Automatic full-text extraction from paper URLs
- Clean JSON mapping storage (DOI → full-text content)
- Fallback to abstracts when full-text unavailable
- Enhanced content quality for better AI responses
AI-Powered Semantic Search
- Vector embeddings using Sentence Transformers (all-mpnet-base-v2)
- FAISS-based similarity search for fast retrieval
- Chunk-based processing for better context matching
- Support for both abstract and full-text embeddings
Intelligent Chat Interface
- Query decomposition for complex research questions
- Multi-query semantic search with result deduplication
- Context-aware responses using Google Gemini
- Source citations with paper references
- Real-time streaming responses
Data Management
- Clean CSV export without pollution (no Full_Text column)
- Separate JSON storage for full-text content
- Efficient file management and organization
- Background processing for large datasets

Prerequisites

Python 3.11 or higher
Node.js 18+ (Latest LTS version recommended)
npm or yarn package manager

API Configuration

Get Google Gemini API Key - Required for AI chat functionality
Get PubMed API Key - Optional but recommended for higher rate limits

Configure environment variables in backend_fastapi/.env:

# Copy the example file
cp backend_fastapi/.env.example backend_fastapi/.env

# Edit the .env file with your API keys
GOOGLE_GENAI_API_KEY="your_gemini_api_key_here"
EMAIL="your_email@example.com"
PUBMED_API_KEY="your_pubmed_api_key_here"

Installation

Clone the repository

git clone https://github.com/yourusername/Research-Rover.git
cd Research-Rover

Backend Setup (FastAPI)

cd backend
uv venv

# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

uv pip install -r requirements.txt
python -m uvicorn main:app --reload --host 0.0.0.0 --port 8000

Frontend Setup
```
cd frontend
npm install
npm run dev
```
Access the Application
- Frontend: http://localhost:5173
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs

Usage Guide

1. Searching Papers

Enter research keywords (e.g., "machine learning healthcare")
The system automatically optimizes queries (expands "ml" to "machine learning")
Monitor real-time search progress with step-by-step indicators
Browse paginated results with rich metadata
Export clean CSV files for offline analysis

2. Creating Embeddings

After searching, create vector embeddings for semantic search
Full-text extraction automatically attempts to get complete paper content
Fallback to abstracts ensures all papers are processed
Progress tracking shows embedding generation status

3. AI-Powered Chat

Ask complex research questions about your collected papers
System decomposes queries into sub-questions for comprehensive answers
Get contextual responses with proper source citations
Real-time streaming responses for better user experience

4. Data Management

Papers stored in clean CSV format (metadata only)
Full-text content stored separately in JSON mapping files
Efficient file organization in the data/ directory
Background processing for large datasets

API Endpoints

Search & Discovery

POST /api/v1/search/: Search for research papers
GET /api/v1/search/progress: Get current search progress
GET /api/v1/files/list: List available CSV files
GET /api/v1/files/download/{filename}: Download CSV files

AI & Embeddings

POST /api/v1/embeddings/{filename}: Create vector embeddings
GET /api/v1/embeddings/{filename}/status: Check embedding status
GET /api/v1/embeddings/progress: Get embedding progress
POST /api/v1/chat/: AI chat with research papers

System

GET /: Health check endpoint
GET /health: Detailed system health status
GET /docs: Interactive API documentation

Architecture

Data Flow

Search: PubMed API → Paper Metadata → CSV Storage
Full-Text: URL Extraction → Web Scraping → JSON Mapping
Embeddings: Text Processing → Vector Generation → FAISS Index
Chat: Query → Semantic Search → LLM Processing → Response

File Structure

data/
├── {query}.csv                           # Paper metadata
├── {query}_full_text_mapping.json       # Full-text content
├── {query}_paper_chunks_hdbscan.index   # FAISS vector index
├── {query}_paper_chunk_metadata_hdbscan.json  # Chunk metadata
└── {query}_paper_data_doi_mapped_hdbscan.json # DOI mappings

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

PubMed/NCBI for providing access to biomedical literature
Google for Gemini AI integration
Sentence Transformers for semantic embeddings
FastAPI community for excellent documentation

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Research Rover

Introduction

Tech Stack

Backend

Frontend

Features

Currently Implemented

Prerequisites

API Configuration

Installation

Usage Guide

1. Searching Papers

2. Creating Embeddings

3. AI-Powered Chat

4. Data Management

API Endpoints

Search & Discovery

AI & Embeddings

System

Architecture

Data Flow

File Structure

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Subrojyoti/Research-Rover

Folders and files

Latest commit

History

Repository files navigation

Research Rover

Introduction

Tech Stack

Backend

Frontend

Features

Currently Implemented

Prerequisites

API Configuration

Installation

Usage Guide

1. Searching Papers

2. Creating Embeddings

3. AI-Powered Chat

4. Data Management

API Endpoints

Search & Discovery

AI & Embeddings

System

Architecture

Data Flow

File Structure

Contributing

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages