Describe the Bug
When crawling websites containing PDF files, the crawler is including the raw PDF contents in both HTML and markdown output fields. This causes performance issues by:
- Significantly increasing response time when retrieving results
- Taking up unnecessary storage space in the results database
- Potentially making the results harder to parse and use effectively
To Reproduce
- Run the crawl command on the target website containing PDF files (eg:
https://becu.org/
)
- Observe the returned results in both HTML and markdown fields (eg:
http://{HOST URL}/v1/crawl/{CRAWL ID}
)
- Notice that PDF contents are being dumped as raw text into these fields
Screenshots

