fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

lzamparo · 2025-05-21T04:29:30Z

Closes #25

Modifies ingest.py to correctly dispatch to PDFParser based on retrieved content header, and modifies PDFParser to retrieve content again and parse from a tempfile. This isn't super clean, but should be a good starting point for caching & parsing PDFs as the issue identified

lzamparo · 2025-06-03T15:01:19Z

@init27 any chance to have a quick look at this? Don't want to get too stale

enables detection of pdf in URLs, and parsing of pdf content via URL

dbd709b

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

Uh oh!

lzamparo commented May 21, 2025

Uh oh!

lzamparo commented Jun 3, 2025

Uh oh!

Uh oh!

fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

Are you sure you want to change the base?

fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

Uh oh!

Conversation

lzamparo commented May 21, 2025

Uh oh!

lzamparo commented Jun 3, 2025

Uh oh!

Uh oh!