Skip to content

Conversation

@techycardiac
Copy link

This commit introduces several improvements to focus my research on academic and valid medical literature, prioritize high-impact sources, and speed up the retrieval process for such queries.

Key changes:

  1. Retriever Output Standardization:

    • All sources I consult now include a retriever_name field in their output dictionaries.
    • SemanticScholarSearch now includes citation_count, venue, and year.
    • PubMedCentralSearch now includes journal_title (when available).
  2. Improved Metadata Pipeline:

    • I've modified my approach to ensure that rich metadata from sources (including retriever_name, citation_count, etc.) is preserved and combined with scraped web content (raw_content).
    • This consolidated list of structured dictionaries is now correctly passed for further curation.
  3. Enhanced Curation Prompt:

    • The way I curate sources has been significantly updated.
    • I now explicitly:
      • Prioritize sources from "semantic_scholar", "pubmed_central", and "arxiv".
      • Utilize citation_count from "semantic_scholar" for ranking.
      • Filter out non-academic/medical content.
      • Consider journal quality and relevance to your query.
  4. Academic Search Focus Configuration:

    • I now leverage the existing (and now verified) FOCUS_ACADEMIC_MEDICAL_SOURCES configuration flag.
    • When True, this flag ensures that I only use academic-focused sources (semantic_scholar, pubmed_central, arxiv), improving relevance and speed for such queries.
  5. Speed Enhancements:

    • The primary speed improvements stem from my focused source selection and more effective curation, reducing the amount of data I need to process.
    • My existing scraping mechanism was confirmed to be asynchronous and efficient.

These changes collectively enable me to perform more targeted and higher-quality research for academic and medical topics, aligning with your requirements.

This commit introduces several improvements to focus my research on academic and valid medical literature, prioritize high-impact sources, and speed up the retrieval process for such queries.

Key changes:

1.  **Retriever Output Standardization:**
    *   All sources I consult now include a `retriever_name` field in their output dictionaries.
    *   `SemanticScholarSearch` now includes `citation_count`, `venue`, and `year`.
    *   `PubMedCentralSearch` now includes `journal_title` (when available).

2.  **Improved Metadata Pipeline:**
    *   I've modified my approach to ensure that rich metadata from sources (including `retriever_name`, `citation_count`, etc.) is preserved and combined with scraped web content (`raw_content`).
    *   This consolidated list of structured dictionaries is now correctly passed for further curation.

3.  **Enhanced Curation Prompt:**
    *   The way I curate sources has been significantly updated.
    *   I now explicitly:
        *   Prioritize sources from "semantic_scholar", "pubmed_central", and "arxiv".
        *   Utilize `citation_count` from "semantic_scholar" for ranking.
        *   Filter out non-academic/medical content.
        *   Consider journal quality and relevance to your query.

4.  **Academic Search Focus Configuration:**
    *   I now leverage the existing (and now verified) `FOCUS_ACADEMIC_MEDICAL_SOURCES` configuration flag.
    *   When `True`, this flag ensures that I only use academic-focused sources (`semantic_scholar`, `pubmed_central`, `arxiv`), improving relevance and speed for such queries.

5.  **Speed Enhancements:**
    *   The primary speed improvements stem from my focused source selection and more effective curation, reducing the amount of data I need to process.
    *   My existing scraping mechanism was confirmed to be asynchronous and efficient.

These changes collectively enable me to perform more targeted and higher-quality research for academic and medical topics, aligning with your requirements.
@assafelovic
Copy link
Owner

@techycardiac this is truly great! Have you fully tested other retrievers and general experience with this addition? Happy to know how and where to help test this before we merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants