A comprehensive Python tool for searching research articles across multiple academic databases with advanced filtering, deduplication, and export capabilities.
-
Multi-Source Search: Query multiple academic databases simultaneously:
- Semantic Scholar
- arXiv
- PubMed
- CrossRef
- OpenAlex
- Google Scholar (unofficial, use with caution)
-
Flexible Search Options:
- Search by keywords, title, or author
- Apply filters for publication year range and citation count
- Configurable result limits per source
-
Smart Caching:
- Avoid repeated API calls with intelligent caching
- Configurable cache expiry times
- Options to clear all or expired cache entries
-
Deduplication:
- Automatic removal of duplicate articles based on DOI and title
- Preserves the most complete record for each unique article
-
Export Capabilities:
- CSV, JSON, BibTeX, RIS, and Excel formats
- APA 7 formatted references for all articles
- Customizable output filenames
-
Rate Limiting:
- Respectful API usage with configurable rate limits
- Enhanced limits when API keys are provided
- Clone the repository:
git clone https://github.com/yourusername/research-finder.git
cd research-finder- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install the required dependencies:
pip install -r requirements.txt- Create a
.envfile in the project root with your API keys (optional but recommended):
# Semantic Scholar API Key (for higher rate limits)
S2_API_KEY=your_semantic_scholar_api_key
# PubMed API Key (for higher rate limits)
PUBMED_API_KEY=your_pubmed_api_key
# OpenAlex Email (for 'polite pool' access)
OPENALEX_EMAIL=your_email@example.com
# CrossRef Email (for 'polite pool' access)
CROSSREF_MAILTO=your_email@example.com
# Optional: Log file path
LOG_FILE=research_finder.log
Run the tool from the command line:
python main.pyFollow the interactive prompts to:
- Select search type (keywords, title, or author)
- Enter your search query
- Set result limits
- Configure cache management
- Apply optional filters
- Select which databases to search
- Choose export format and filename
--- Research Article Finder ---
What would you like to search by?
1. Keywords (in title, abstract, etc.)
2. Title
3. Author
Select search type (1-3, default=1): 1
Enter keywords to search for: machine learning in healthcare
Enter max results per source (e.g., 10): 20
--- Cache Management ---
1. Don't clear cache (use existing cached results)
2. Clear only expired cache entries
3. Clear all cache entries
Select cache option (1-3, default=1): 2
--- Filter Search Results (Optional) ---
Would you like to apply any filters to the search results? (y/n, default=n): y
--- Set Filter Criteria ---
Filter by publication year? (y/n, default=n): y
Enter start year (e.g., 2020, leave blank for no limit): 2020
Enter end year (e.g., 2023, leave blank for no limit): 2023
Filter by minimum citation count? (y/n, default=n): y
Enter minimum citation count (e.g., 50): 50
--- Select Search Vendors ---
1. Semantic Scholar
2. arXiv
3. PubMed
4. CrossRef
5. OpenAlex
6. Google Scholar (Unreliable)
Enter vendor numbers to use (e.g., 1,2) or press Enter for all: 1,3,5
--- Search Summary ---
Successfully searched: Semantic Scholar, PubMed, OpenAlex
Found 45 unique articles.
Would you like to export these results? (y/n): y
--- Select Export Format ---
1. CSV
2. JSON
3. BibTeX
4. RIS
5. Excel
Select export format (1-5, default=1): 1
Enter output filename (without extension): ml_healthcare_2020_2023
research_finder/
├── cache/ # Cache storage directory
├── output/ # Exported files directory
├── research_finder/ # Main package
│ ├── __init__.py
│ ├── aggregator.py # Coordinates searches across sources
│ ├── cache.py # Caching functionality
│ ├── exporter.py # Export functionality
│ ├── utils.py # Utility functions
│ ├── validator.py # Configuration validation
│ └── searchers/ # Database-specific searchers
│ ├── __init__.py
│ ├── arxiv.py
│ ├── base_searcher.py
│ ├── crossref.py
│ ├── google_scholar.py
│ ├── openalex.py
│ ├── pubmed.py
│ └── semantic_scholar.py
├── tests/ # Test suite
├── config.py # Configuration settings
├── LICENSE # MIT License
├── main.py # Main entry point
├── README.md # This file
└── requirements.txt # Dependencies
While the tool works without API keys, providing them enables higher rate limits:
| Source | API Key/Email | Rate Limit with Key | Rate Limit without Key |
|---|---|---|---|
| Semantic Scholar | S2_API_KEY | 1 req/sec | 1 req/10 sec |
| PubMed | PUBMED_API_KEY | 10 req/sec | 3 req/sec |
| OpenAlex | OPENALEX_EMAIL | 10 req/sec | 2 req/sec |
| CrossRef | CROSSREF_MAILTO | 1 req/sec | 1 req/2 sec |
| arXiv | Not required | 2 req/sec | 2 req/sec |
| Google Scholar | Not available | 1 req/5 sec | 1 req/5 sec |
Run the test suite:
pytest- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
When adding a new searcher:
- Inherit from
BaseSearcherinresearch_finder/searchers/base_searcher.py - Implement the required
searchmethod - Add tests in the
tests/test_searchersdirectory - Update documentation
This project is licensed under the MIT License - see the LICENSE file for details.
- Google Scholar has no official API, and the scraper is unreliable and may be blocked. Use with caution.
- Always respect the terms of service of the APIs you are using.
- Citation counts may vary between sources as they are calculated differently.
- Semantic Scholar Academic Graph API
- arXiv API
- NCBI Entrez APIs
- CrossRef REST API
- OpenAlex
- scholarly for Google Scholar access