Research Rover is a modern, AI-powered web application designed to help researchers and students efficiently discover, analyze, and interact with academic research papers. The application provides an intelligent interface for paper discovery, full-text extraction, semantic search, and AI-powered chat functionality, making research more efficient and insightful.
- Python 3.11+
- FastAPI: Modern, fast web framework for building REST APIs
- Uvicorn: ASGI server for FastAPI
- Pydantic: Data validation and settings management
- Sentence Transformers: AI embeddings for semantic search
- FAISS: Vector similarity search and clustering
- Crawl4ai: Advanced web scraping for full-text extraction
- BeautifulSoup4: HTML parsing and content extraction
- Pandas: Data manipulation and analysis
- Google Generative AI: LLM integration for chat functionality
- Asyncio: Asynchronous programming support
- React 18: Modern UI library with hooks
- TypeScript: Type-safe JavaScript development
- Vite: Fast build tool and development server
- Tailwind CSS: Utility-first CSS framework
- Shadcn/ui: Modern component library
- Framer Motion: Smooth animations and transitions
- React Router: Client-side routing
- Lucide React: Beautiful icon library
-
Intelligent Paper Search
- PubMed integration with advanced query optimization
- Real-time search progress tracking with step indicators
- Smart query enhancement (abbreviation expansion, typo correction)
- Comprehensive metadata extraction (DOI, authors, keywords, abstracts)
- Paginated results with client-side filtering
-
Full-Text Extraction & Processing
- Automatic full-text extraction from paper URLs
- Clean JSON mapping storage (DOI → full-text content)
- Fallback to abstracts when full-text unavailable
- Enhanced content quality for better AI responses
-
AI-Powered Semantic Search
- Vector embeddings using Sentence Transformers (all-mpnet-base-v2)
- FAISS-based similarity search for fast retrieval
- Chunk-based processing for better context matching
- Support for both abstract and full-text embeddings
-
Intelligent Chat Interface
- Query decomposition for complex research questions
- Multi-query semantic search with result deduplication
- Context-aware responses using Google Gemini
- Source citations with paper references
- Real-time streaming responses
-
Data Management
- Clean CSV export without pollution (no Full_Text column)
- Separate JSON storage for full-text content
- Efficient file management and organization
- Background processing for large datasets
- Python 3.11 or higher
- Node.js 18+ (Latest LTS version recommended)
- npm or yarn package manager
- Get Google Gemini API Key - Required for AI chat functionality
- Get PubMed API Key - Optional but recommended for higher rate limits
Configure environment variables in backend_fastapi/.env
:
# Copy the example file
cp backend_fastapi/.env.example backend_fastapi/.env
# Edit the .env file with your API keys
GOOGLE_GENAI_API_KEY="your_gemini_api_key_here"
EMAIL="your_email@example.com"
PUBMED_API_KEY="your_pubmed_api_key_here"
-
Clone the repository
git clone https://github.com/yourusername/Research-Rover.git cd Research-Rover
-
Backend Setup (FastAPI)
cd backend uv venv # Windows .venv\Scripts\activate # macOS/Linux source .venv/bin/activate uv pip install -r requirements.txt python -m uvicorn main:app --reload --host 0.0.0.0 --port 8000
-
Frontend Setup
cd frontend npm install npm run dev
-
Access the Application
- Frontend: http://localhost:5173
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Enter research keywords (e.g., "machine learning healthcare")
- The system automatically optimizes queries (expands "ml" to "machine learning")
- Monitor real-time search progress with step-by-step indicators
- Browse paginated results with rich metadata
- Export clean CSV files for offline analysis
- After searching, create vector embeddings for semantic search
- Full-text extraction automatically attempts to get complete paper content
- Fallback to abstracts ensures all papers are processed
- Progress tracking shows embedding generation status
- Ask complex research questions about your collected papers
- System decomposes queries into sub-questions for comprehensive answers
- Get contextual responses with proper source citations
- Real-time streaming responses for better user experience
- Papers stored in clean CSV format (metadata only)
- Full-text content stored separately in JSON mapping files
- Efficient file organization in the
data/
directory - Background processing for large datasets
POST /api/v1/search/
: Search for research papersGET /api/v1/search/progress
: Get current search progressGET /api/v1/files/list
: List available CSV filesGET /api/v1/files/download/{filename}
: Download CSV files
POST /api/v1/embeddings/{filename}
: Create vector embeddingsGET /api/v1/embeddings/{filename}/status
: Check embedding statusGET /api/v1/embeddings/progress
: Get embedding progressPOST /api/v1/chat/
: AI chat with research papers
GET /
: Health check endpointGET /health
: Detailed system health statusGET /docs
: Interactive API documentation
- Search: PubMed API → Paper Metadata → CSV Storage
- Full-Text: URL Extraction → Web Scraping → JSON Mapping
- Embeddings: Text Processing → Vector Generation → FAISS Index
- Chat: Query → Semantic Search → LLM Processing → Response
data/
├── {query}.csv # Paper metadata
├── {query}_full_text_mapping.json # Full-text content
├── {query}_paper_chunks_hdbscan.index # FAISS vector index
├── {query}_paper_chunk_metadata_hdbscan.json # Chunk metadata
└── {query}_paper_data_doi_mapped_hdbscan.json # DOI mappings
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- PubMed/NCBI for providing access to biomedical literature
- Google for Gemini AI integration
- Sentence Transformers for semantic embeddings
- FastAPI community for excellent documentation