-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Open
Labels
🐞 BugSomething isn't workingSomething isn't working🩺 Needs TriageNeeds attention of maintainersNeeds attention of maintainers
Description
crawl4ai version
0.6.3
Expected Behavior
my example crawler:
llm_strategy = LLMExtractionStrategy(
llm_config=self.llm_config,
schema=PdfDoc.model_json_schema(),
extraction_type="schema",
instruction="""
From the crawled content, extract data from html
- data in html source include pdf file name and href url under data-pdf style attribute
- One extracted post JSON format should look like this:
{
"name": "Volume 1 - 2024 CMD Rate Case - E-filing.pdf",
"data_pdf": "/DMS/pdfview/``wportal`Documents`DMS`31`2494`/Volume 1 - 2024 CMD Rate Case - E-filing~pdf",
}
""",
input_format="cleaned_html",
)
run_conf = CrawlerRunConfig(
extraction_strategy=llm_strategy,
cache_mode=CacheMode.DISABLED,
target_elements=[
".btnOpenPdfFile",
".btn",
".btn-default",
".btn-sm",
],
excluded_tags=[
"header",
"footer",
"nav",
"meta",
"script",
"style",
"iframe",
"li",
"ul",
],
prettiify=True,
)
resp = requests.get(url)
async with AsyncWebCrawler(config=self.browser_conf) as crawler:
result = await crawler.arun(url=f"raw://{resp.text}", config=run_conf)
assert isinstance(result, CrawlResultContainer)
Expected: only cleaned_html
should be sent to LLM server
Current Behavior
The behavior of llm extract strategy is send both url and html content to llm server. I think we doesn't handle the case when url input as raw html. That leading to full raw html under url
always sending to llm server. It's likely unexpected behavior, the token quota may leak due to it.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
Linux
Python version
3.12.3
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Metadata
Metadata
Assignees
Labels
🐞 BugSomething isn't workingSomething isn't working🩺 Needs TriageNeeds attention of maintainersNeeds attention of maintainers