- Fetches academic RSS feeds, filters entries with per-topic regex, and writes results into SQLite databases. HTML pages with the results are rendered directly from the database.
- Results are ranked by cosine similarity to a set of user defined keywords. Configurable list of authors can get a ranking boost. So papers from your friends / competitors can be boosted in ranking.
- Highest ranked results are summarized by an LLM. For summarization to work, you need an OpenAI API key.
- New search terms can be created by simply adding a yaml config file under:
config/topics/your_topic_name.yaml
. Look at the other topics for guidance. The script is configured to run using Github actions each day at 6 AM to fetch new papers. For your own runs, configure the.github/workflows/pages.yml
file.
Written using Python 3.11.
For dependencies check requirements.txt
.
OpenAI API key is searched for in the openaikulcs.env
file in the repo root or OPENAI_API_KEY
environment variable.
-
Filter: fetch from RSS feed list, dedup, match regex, write DBs, render HTML
python cli/main.py -v filter [--topic TOPIC]
- Backs up
all_feed_entries.db
andmatched_entries_history.db
(keeps 3 latest backups). - Clears
papers.db
working table before processing this run. - Fetches configured feeds for each topic, dedups by title against
all_feed_entries.db
, filters by topic regex. - Writes matches to
papers.db
(status='filtered'
); optionally archives tomatched_entries_history.db
ifoutput.archive: true
. - Saves ALL processed entries (matched and non-matched) to
all_feed_entries.db
for future dedup. - Renders per-topic HTML from
papers.db
.
-
Rank: ranks fetched papers by title, wrt keywords in yaml config (optional)
python cli/main.py rank [--topic TOPIC]
- Computes and writes
rank_score
forpapers.db
entries using sentence-transformers. HTML files with ranked entries are generated. - Model selection: if
models/all-MiniLM-L6-v2
exists, it is used; otherwise it falls back to the Hugging Face repo idall-MiniLM-L6-v2
and downloads once into cache. You can vendor the model withpython scripts/vendor_model.py
. - Scoring details: applies a small penalty for
ranking.negative_queries
matches (title/summary). Optional boosts: per-topicranking.preferred_authors
withranking.priority_author_boost
, and globalpriority_journal_boost
for feeds listed inpriority_journals
.
-
Abstracts: try to fetch abstracts
python cli/main.py abstracts [--topic TOPIC] [--mailto you@example.com] [--limit N] [--rps 1.0]
- Order: 1) Fill arXiv/cond-mat abstracts from
summary
(no threshold) 2) Above-threshold: Crossref (DOI, then title) 3) Above-threshold: Semantic Scholar → OpenAlex → PubMed - Threshold: topic
abstract_fetch.rank_threshold
else globaldefaults.rank_threshold
. - Only topics with
abstract_fetch.enabled: true
are processed. - Writes to both
papers.db.entries.abstract
andmatched_entries_history.db.matched_entries.abstract
. - Rate limiting: descriptive User-Agent (includes
--mailto
or$MAILTO
), respects Retry-After; default ~1 req/sec via--rps
. - Populates the
entries.abstract
column; leaves other fields unchanged. - Contact email: if
--mailto
is not provided, the command reads$MAILTO
from the environment; if unset, it uses a safe default.
-
Summarize (optional)
python cli/main.py summarize [--topic TOPIC] [--rps 0.5]
- Selects top entries per topic based on
llm_summary.score_cutoff
andllm_summary.top_n
, builds input strictly fromtitle + abstract
(skips entries without an abstract), and calls the configured OpenAI chat model. - Writes summaries to
papers.db.entries.llm_summary
and, when present,matched_entries_history.db.matched_entries.llm_summary
. - Uses
config.llm
settings. Supports JSON or plain-text responses; JSON is preferred and rendered with headings. - Note: This command only updates databases. Use
html
to render pages.
-
pqa_summary (PDF-based summarization)
python cli/main.py pqa_summary [--topic TOPIC]
- Selects preprints from arXiv in
papers.db
withrank_score >= config.paperqa.download_rank_threshold
, detects arXiv IDs, and downloads PDFs (polite arXiv API usage). - Runs paper-qa to summarize full text into JSON keys:
summary
,topical_relevance
,methods
,novelty_impact
. - Writes summaries to
papers.db.entries.paper_qa_summary
only for the specific topic row the item was selected under (no longer cross-updating all topics for the same entry id), and tomatched_entries_history.db.matched_entries.paper_qa_summary
. - Note: This command only updates databases. Use
html
to render pages.
-
HTML (render only; no fetching)
python cli/main.py html [--topic TOPIC]
- Reads from
papers.db
and generates, per topic:- Filtered page:
output.filename
(if configured) - Ranked page:
output.filename_ranked
(if configured and entries exist) - Summary page:
output.filename_summary
(if configured). Content priority: PDF summaries → LLM summaries → abstract-only fallback; always ordered by rank.
- Filtered page:
-
When
output.filename_summary
is set for a topic, summary pages prefer content in this order:paper_qa_summary
(PDF-based)llm_summary
- Fallback to ranked fields (abstract → summary)
-
Entries are ordered by descending
rank_score
. -
Purge
python cli/main.py purge --days N
removes entries withpublished_date
within the most recent N days in the seen entries DB.python cli/main.py purge --all
deletes all DB files and reinitializes schemas (no confirmation prompt).
-
Status
python cli/main.py status
- Validates config, lists topics/feeds, and shows DB paths.
-
Email (Mailman digest)
python cli/main.py email [--topic TOPIC] [--mode auto|ranked] [--limit N] [--dry-run]
- Builds an email‑friendly HTML digest from
papers.db
and sends via SMTP (SSL). --mode auto
renders a ranked‑style list directly frompapers.db
;--mode ranked
embeds the pre‑generated ranked HTML if present, otherwise falls back to the ranked‑style list.--dry-run
writes a preview HTML toassets/
instead of sending.- Per‑recipient routing:
python cli/main.py email --recipients config/secrets/mailing_lists.yaml
or setconfig.email.recipients_file
.
Add an email
section in config/config.yaml
(secrets in a separate file):
email:
to: "LIST_ADDRESS@yourdomain" # Mailman list address
subject_prefix: "Paper Firehose" # Optional
from: "_mainaccount@nemeslab.com" # Defaults to smtp.username
smtp:
host: "mail.nemeslab.com"
port: 465
username: "_mainaccount@nemeslab.com"
password_file: "config/secrets/email_password.txt" # store only the password here
Notes
- Store the SMTP password in
config/secrets/email_password.txt
(gitignored). - The command prefers to be run after
filter
,rank
,abstracts
, andsummarize
so it can include LLM summaries when available.
Per‑recipient YAML (config/secrets/mailing_lists.yaml) Example:
recipients:
- to: "materials-list@nemeslab.com"
topics: ["primary", "perovskites"]
min_rank_score: 0.40
- to: "2d-list@nemeslab.com"
topics: ["rg", "2d_metals"]
min_rank_score: 0.35
You can call the main steps programmatically via paper_firehose
.
Basics
import paper_firehose as pf
- All functions default to
config/config.yaml
; override withconfig_path="..."
.
Functions
pf.filter(topic=None, config_path=None)
: Runs the filter step for one topic or all.pf.rank(topic=None, config_path=None)
: Computes and writesrank_score
for entries.pf.abstracts(topic=None, *, mailto=None, limit=None, rps=None, config_path=None)
: Fetches abstracts for above‑threshold entries and writes to DBs.pf.summarize(topic=None, *, rps=None, config_path=None)
: Runs LLM summarization for top‐ranked entries per topic, writing JSON (or text) tollm_summary
. Generates summary HTML if configured.pf.generate_html(topic, output_path=None, config_path=None)
: Regenerates the filtered list HTML directly frompapers.db
for the given topic (uses topic description, and defaults to the topic’s configuredoutput.filename
ifoutput_path
is omitted).pf.purge(days=None, all_data=False, config_path=None)
: Purges entries based on publication date. Whendays
is provided, removes entries from the most recent N days (including today) across all DBs; whenall_data=True
, reinitializes all DBs.pf.status(config_path=None) -> dict
: Returns a dict with config validity, topics, feed count, and DB paths.
-
history_viewer.html
is a static browser viewer forassets/matched_entries_history.db
(tablematched_entries
). -
By default it auto-loads the latest history DB from GitHub:
- Displayed:
https://github.com/zrbyte/paper-firehose/tree/data/assets/matched_entries_history.latest.db
- The viewer automatically normalizes GitHub page links to their raw content (e.g.,
raw.githubusercontent.com
) before fetching.
- Displayed:
-
You can override with a query param or local file:
history_viewer.html?db=<url>
to load a specific remote DB- Use the file input or drag-and-drop a local
matched_entries_history.db
-
history_viewer_cards.html
provides a cleaner, card‑style view of history entries with just the key fields (title, authors, feed name, abstract, matched date). It supports the same controls and query params ashistory_viewer.html
(topic, order, search,?db=<url>
and file drag‑and‑drop) but focuses on readability instead of tabular data.
- Three-DB architecture:
assets/all_feed_entries.db
: Every fetched item (for deduplication).assets/matched_entries_history.db
: All matched items across topics and runs (historical archive).assets/papers.db
: Current-run working set (filtered → ranked → summarized).
- YAML-driven configuration for feeds and topics.
- HTML generated from
papers.db
so you can re-render without refetching. - Optional LLM summarization writes JSON summaries to DB and renders dedicated summary pages.
-
all_feed_entries.db
(tablefeed_entries
)- Keys:
entry_id
(pk),feed_name
(display name fromconfig.yaml
),title
,link
. - Metadata:
summary
,authors
,published_date
,first_seen
,last_seen
,raw_data
(JSON). - Used only for dedup; populated after filtering completes.
- Keys:
-
matched_entries_history.db
(tablematched_entries
)- Keys:
entry_id
(pk),feed_name
,topics
(CSV of topic names). - Metadata:
title
,link
,summary
,authors
,abstract
(nullable),doi
(nullable),published_date
,matched_date
,raw_data
(JSON),llm_summary
(nullable),paper_qa_summary
(nullable). - Written only when a topic’s
output.archive: true
.
- Keys:
-
papers.db
(tableentries
)- Primary key: composite
PRIMARY KEY(id, topic)
so the same entry can appear once per topic. - Columns:
id
,topic
,feed_name
(display name),title
,link
,summary
,authors
,abstract
(nullable),doi
(nullable),published_date
,discovered_date
,status
(filtered|ranked|summarized
),rank_score
,rank_reasoning
,llm_summary
,raw_data
(JSON).
- Primary key: composite
Notes
feed_name
is the human-readable name fromconfig.yaml -> feeds.<key>.name
(e.g., "Nature Physics").doi
is best-effort, fetched from RSS feed and can be found indoi
,dc:identifier
,prism:doi
,id
,link
,summary
,summary_detail.value
, or embeddedcontent[].value
. arXiv feeds may not include DOIs; no external lookup is performed (by design for now).abstract
populated via Crossref API or publisher APIs.
-
config/config.yaml
(feeds, DB paths, defaults)- Each feed has a key and a display
name
; the key is used in topic files, the name is stored in DBs. paperqa
: settings for the arXiv downloader (Phase 1)download_rank_threshold
: minimumrank_score
to download (default 0.35)rps
: requests/second throttle (default 0.3; ~1 request/3.3s per arXiv API guidance)max_retries
: per-item retry attempts on transient errorsprompt
: paper-qa question used for summarization; should instruct the model to return only JSON with keyssummary
,topical_relevance
,methods
,novelty_impact
(supports{ranking_query}
placeholder)
- Each feed has a key and a display
-
config/topics/<topic>.yaml
feeds
: list of feed keys fromconfig.yaml
.filter.pattern
andfilter.fields
: regex and fields to match (defaults includetitle
andsummary
).ranking
: optionalquery
,model
, cutoffs, etc. (for the rank command).- Optional:
negative_queries
(list),preferred_authors
(list of names),priority_author_boost
(float, e.g., 0.1).
- Optional:
output.filename
andoutput.filename_ranked
: HTML output;archive: true
enables history DB writes.llm_summary
: topic-level controls for LLM summarization.enabled: true|false
prompt
: instruction given to the model. You can reference{ranking_query}
and it will be replaced with the topic’sranking.query
.score_cutoff
: minimumrank_score
to consider (0.0–1.0)top_n
: hard cap on the number of entries considered (after filtering by score)- Works together with global
config.llm
below.
-
config.llm
(global model settings)model
: preferred chat model idmodel_fallback
: secondary model if the primary is unsupported/unavailableapi_key_env
: environment variable name to read ifopenaikulcs.env
is missingrps
: default requests/second throttle for summarizationmax_retries
: retry attempts per item on transient errors- Optional GPT‑5 parameters:
verbosity
,reasoning_effort
(used when the model starts withgpt-5
)
Thank you to arXiv for use of its open access interoperability.