Extract web text directly instead of OCR

I'm working on something pretty similar to what you guys are doing and had a thought. Why not grab text directly from the web instead of using OCR? Langchain and llamaindex both have such tools, and there are also some repos about converting html to markdown.

Just a thought. Would love to know what you think!