Skip to content

fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lzamparo
Copy link

Closes #25

Modifies ingest.py to correctly dispatch to PDFParser based on retrieved content header, and modifies PDFParser to retrieve content again and parse from a tempfile. This isn't super clean, but should be a good starting point for caching & parsing PDFs as the issue identified

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 21, 2025
@lzamparo
Copy link
Author

lzamparo commented Jun 3, 2025

@init27 any chance to have a quick look at this? Don't want to get too stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue on ingest PDF
2 participants