Skip to content

Commit f6009d1

Browse files
committed
fix: better playwright installation handling
1 parent e374e05 commit f6009d1

File tree

6 files changed

+204
-143
lines changed

6 files changed

+204
-143
lines changed

README.md

Lines changed: 47 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -24,21 +24,6 @@ Just say which information you want to extract and the library will do it for yo
2424
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/sgai-hero.png" alt="ScrapeGraphAI Hero" style="width: 100%;">
2525
</p>
2626

27-
## 🔗 ScrapeGraph API & SDKs
28-
If you are looking for a quick solution to integrate ScrapeGraph in your system, check out our powerful API [here!](https://dashboard.scrapegraphai.com/login)
29-
30-
<p align="center">
31-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png" alt="ScrapeGraph API Banner" style="width: 100%;">
32-
</p>
33-
34-
We offer SDKs in both Python and Node.js, making it easy to integrate into your projects. Check them out below:
35-
36-
| SDK | Language | GitHub Link |
37-
|-----------|----------|-----------------------------------------------------------------------------|
38-
| Python SDK | Python | [scrapegraph-py](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py) |
39-
| Node.js SDK | Node.js | [scrapegraph-js](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-js) |
40-
41-
The Official API Documentation can be found [here](https://docs.scrapegraphai.com/).
4227

4328
## 🚀 Quick install
4429

@@ -47,6 +32,7 @@ The reference page for Scrapegraph-ai is available on the official page of PyPI:
4732
```bash
4833
pip install scrapegraphai
4934

35+
# IMPORTANT (to fetch webpage content)
5036
playwright install
5137
```
5238

@@ -84,13 +70,12 @@ The most common one is the `SmartScraperGraph`, which extracts information from
8470

8571

8672
```python
87-
import json
8873
from scrapegraphai.graphs import SmartScraperGraph
8974

9075
# Define the configuration for the scraping pipeline
9176
graph_config = {
9277
"llm": {
93-
"api_key": "YOUR_OPENAI_APIKEY",
78+
"api_key": "YOUR_OPENAI_API_KEY",
9479
"model": "openai/gpt-4o-mini",
9580
},
9681
"verbose": True,
@@ -99,33 +84,45 @@ graph_config = {
9984

10085
# Create the SmartScraperGraph instance
10186
smart_scraper_graph = SmartScraperGraph(
102-
prompt="Extract me all the news from the website",
103-
source="https://www.wired.com",
87+
prompt="Extract useful information from the webpage, including a description of what the company does, founders and social media links",
88+
source="https://scrapegraphai.com/",
10489
config=graph_config
10590
)
10691

10792
# Run the pipeline
10893
result = smart_scraper_graph.run()
94+
95+
import json
10996
print(json.dumps(result, indent=4))
11097
```
11198

11299
The output will be a dictionary like the following:
113100

114101
```python
115-
"result": {
116-
"news": [
117-
{
118-
"title": "The New Jersey Drone Mystery May Not Actually Be That Mysterious",
119-
"link": "https://www.wired.com/story/new-jersey-drone-mystery-maybe-not-drones/",
120-
"author": "Lily Hay Newman"
121-
},
122-
{
123-
"title": "Former ByteDance Intern Accused of Sabotage Among Winners of Prestigious AI Award",
124-
"link": "https://www.wired.com/story/bytedance-intern-best-paper-neurips/",
125-
"author": "Louise Matsakis"
126-
},
127-
...
128-
]
102+
{
103+
"description": "ScrapeGraphAI transforms websites into clean, organized data for AI agents and data analytics. It offers an AI-powered API for effortless and cost-effective data extraction.",
104+
"founders": [
105+
{
106+
"name": "Marco Perini",
107+
"role": "Founder & Technical Lead",
108+
"linkedin": "https://www.linkedin.com/in/perinim/"
109+
},
110+
{
111+
"name": "Marco Vinciguerra",
112+
"role": "Founder & Software Engineer",
113+
"linkedin": "https://www.linkedin.com/in/marco-vinciguerra-7ba365242/"
114+
},
115+
{
116+
"name": "Lorenzo Padoan",
117+
"role": "Founder & Product Engineer",
118+
"linkedin": "https://www.linkedin.com/in/lorenzo-padoan-4521a2154/"
119+
}
120+
],
121+
"social_media_links": {
122+
"linkedin": "https://www.linkedin.com/company/101881123",
123+
"twitter": "https://x.com/scrapegraphai",
124+
"github": "https://github.com/ScrapeGraphAI/Scrapegraph-ai"
125+
}
129126
}
130127
```
131128
There are other pipelines that can be used to extract information from multiple pages, generate Python scripts, or even generate audio files.
@@ -145,20 +142,30 @@ It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**,
145142

146143
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command, if you want to use local models.
147144

148-
## 🔍 Demo
149-
Official streamlit demo:
150-
151-
[![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-demo-demo.streamlit.app)
152145

153-
Try it directly on the web using Google Colab:
146+
## 📖 Documentation
154147

155148
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
156149

157-
## 📖 Documentation
158-
159150
The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).
160151
Check out also the Docusaurus [here](https://docs-oss.scrapegraphai.com/).
161152

153+
## 🔗 ScrapeGraph API & SDKs
154+
If you are looking for a quick solution to integrate ScrapeGraph in your system, check out our powerful API [here!](https://dashboard.scrapegraphai.com/login)
155+
156+
<p align="center">
157+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png" alt="ScrapeGraph API Banner" style="width: 100%;">
158+
</p>
159+
160+
We offer SDKs in both Python and Node.js, making it easy to integrate into your projects. Check them out below:
161+
162+
| SDK | Language | GitHub Link |
163+
|-----------|----------|-----------------------------------------------------------------------------|
164+
| Python SDK | Python | [scrapegraph-py](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py) |
165+
| Node.js SDK | Node.js | [scrapegraph-js](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-js) |
166+
167+
The Official API Documentation can be found [here](https://docs.scrapegraphai.com/).
168+
162169
## 🏆 Sponsors
163170
<div style="text-align: center;">
164171
<a href="https://2ly.link/1zaXG">

pyproject.toml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,9 @@ dependencies = [
3232
"fastembed>=0.3.6",
3333
"semchunk>=2.2.0",
3434
"transformers>=4.44.2",
35-
"transformers>=4.44.2",
3635
"googlesearch-python>=1.2.5",
3736
"async-timeout>=4.0.3",
38-
"transformers>=4.44.2",
39-
"googlesearch-python>=1.2.5",
4037
"simpleeval>=1.0.0",
41-
"async_timeout>=4.0.3",
4238
"scrapegraph-py>=1.7.0"
4339
]
4440

scrapegraphai/docloaders/chromium.py

Lines changed: 39 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,6 @@ class ChromiumLoader(BaseLoader):
2323
requires_js_support: Flag to determine if JS rendering is required.
2424
"""
2525

26-
RETRY_LIMIT = 3
27-
TIMEOUT = 10
28-
2926
def __init__(
3027
self,
3128
urls: List[str],
@@ -37,6 +34,8 @@ def __init__(
3734
requires_js_support: bool = False,
3835
storage_state: Optional[str] = None,
3936
browser_name: str = "chromium", #default chromium
37+
retry_limit: int = 1,
38+
timeout: int = 10,
4039
**kwargs: Any,
4140
):
4241
"""Initialize the loader with a list of URL paths.
@@ -47,6 +46,8 @@ def __init__(
4746
proxy: A dictionary containing proxy information; None disables protection.
4847
urls: A list of URLs to scrape content from.
4948
requires_js_support: Whether to use JS rendering for scraping.
49+
retry_limit: Maximum number of retry attempts for scraping. Defaults to 3.
50+
timeout: Maximum time in seconds to wait for scraping. Defaults to 10.
5051
kwargs: A dictionary containing additional browser kwargs.
5152
5253
Raises:
@@ -68,12 +69,17 @@ def __init__(
6869
self.requires_js_support = requires_js_support
6970
self.storage_state = storage_state
7071
self.browser_name = browser_name
72+
self.retry_limit = retry_limit
73+
self.timeout = timeout
7174

7275
async def scrape(self, url:str) -> str:
7376
if self.backend == "playwright":
7477
return await self.ascrape_playwright(url)
7578
elif self.backend == "selenium":
76-
return await self.ascrape_undetected_chromedriver(url)
79+
try:
80+
return await self.ascrape_undetected_chromedriver(url)
81+
except Exception as e:
82+
raise ValueError(f"Failed to scrape with undetected chromedriver: {e}")
7783
else:
7884
raise ValueError(f"Unsupported backend: {self.backend}")
7985

@@ -97,9 +103,9 @@ async def ascrape_undetected_chromedriver(self, url: str) -> str:
97103
results = ""
98104
attempt = 0
99105

100-
while attempt < self.RETRY_LIMIT:
106+
while attempt < self.retry_limit:
101107
try:
102-
async with async_timeout.timeout(self.TIMEOUT):
108+
async with async_timeout.timeout(self.timeout):
103109
# Handling browser selection
104110
if self.backend == "selenium":
105111
if self.browser_name == "chromium":
@@ -134,9 +140,9 @@ async def ascrape_undetected_chromedriver(self, url: str) -> str:
134140
except (aiohttp.ClientError, asyncio.TimeoutError) as e:
135141
attempt += 1
136142
logger.error(f"Attempt {attempt} failed: {e}")
137-
if attempt == self.RETRY_LIMIT:
143+
if attempt == self.retry_limit:
138144
results = (
139-
f"Error: Network error after {self.RETRY_LIMIT} attempts - {e}"
145+
f"Error: Network error after {self.retry_limit} attempts - {e}"
140146
)
141147
finally:
142148
driver.quit()
@@ -204,7 +210,7 @@ async def ascrape_playwright_scroll(
204210
results = ""
205211
attempt = 0
206212

207-
while attempt < self.RETRY_LIMIT:
213+
while attempt < self.retry_limit:
208214
try:
209215
async with async_playwright() as p:
210216
browser = None
@@ -268,8 +274,8 @@ async def ascrape_playwright_scroll(
268274
except (aiohttp.ClientError, asyncio.TimeoutError, Exception) as e:
269275
attempt += 1
270276
logger.error(f"Attempt {attempt} failed: {e}")
271-
if attempt == self.RETRY_LIMIT:
272-
results = f"Error: Network error after {self.RETRY_LIMIT} attempts - {e}"
277+
if attempt == self.retry_limit:
278+
results = f"Error: Network error after {self.retry_limit} attempts - {e}"
273279
finally:
274280
await browser.close()
275281

@@ -283,7 +289,11 @@ async def ascrape_playwright(self, url: str, browser_name: str = "chromium") ->
283289
url (str): The URL to scrape.
284290
285291
Returns:
286-
str: The scraped HTML content or an error message if an exception occurs.
292+
str: The scraped HTML content
293+
294+
Raises:
295+
RuntimeError: When retry limit is reached without successful scraping
296+
ValueError: When an invalid browser name is provided
287297
"""
288298
from playwright.async_api import async_playwright
289299
from undetected_playwright import Malenia
@@ -292,9 +302,9 @@ async def ascrape_playwright(self, url: str, browser_name: str = "chromium") ->
292302
results = ""
293303
attempt = 0
294304

295-
while attempt < self.RETRY_LIMIT:
305+
while attempt < self.retry_limit:
296306
try:
297-
async with async_playwright() as p, async_timeout.timeout(self.TIMEOUT):
307+
async with async_playwright() as p, async_timeout.timeout(self.timeout):
298308
browser = None
299309
if browser_name == "chromium":
300310
browser = await p.chromium.launch(
@@ -315,41 +325,37 @@ async def ascrape_playwright(self, url: str, browser_name: str = "chromium") ->
315325
await page.wait_for_load_state(self.load_state)
316326
results = await page.content()
317327
logger.info("Content scraped")
318-
break
328+
return results
319329
except (aiohttp.ClientError, asyncio.TimeoutError, Exception) as e:
320330
attempt += 1
321331
logger.error(f"Attempt {attempt} failed: {e}")
322-
if attempt == self.RETRY_LIMIT:
323-
results = f"Error: Network error after {self.RETRY_LIMIT} attempts - {e}"
332+
if attempt == self.retry_limit:
333+
raise RuntimeError(f"Failed to scrape after {self.retry_limit} attempts: {str(e)}")
324334
finally:
325-
if "browser" in locals():
326-
await browser.close()
327-
328-
329-
return results
330-
331-
335+
await browser.close()
332336

333-
async def ascrape_with_js_support(self, url: str , browser_name:str = "chromium") -> str:
337+
async def ascrape_with_js_support(self, url: str, browser_name: str = "chromium") -> str:
334338
"""
335339
Asynchronously scrape the content of a given URL by rendering JavaScript using Playwright.
336340
337341
Args:
338342
url (str): The URL to scrape.
339343
340344
Returns:
341-
str: The fully rendered HTML content after JavaScript execution,
342-
or an error message if an exception occurs.
345+
str: The fully rendered HTML content after JavaScript execution
346+
347+
Raises:
348+
RuntimeError: When retry limit is reached without successful scraping
349+
ValueError: When an invalid browser name is provided
343350
"""
344351
from playwright.async_api import async_playwright
345352

346353
logger.info(f"Starting scraping with JavaScript support for {url}...")
347-
results = ""
348354
attempt = 0
349355

350-
while attempt < self.RETRY_LIMIT:
356+
while attempt < self.retry_limit:
351357
try:
352-
async with async_playwright() as p, async_timeout.timeout(self.TIMEOUT):
358+
async with async_playwright() as p, async_timeout.timeout(self.timeout):
353359
browser = None
354360
if browser_name == "chromium":
355361
browser = await p.chromium.launch(
@@ -368,19 +374,15 @@ async def ascrape_with_js_support(self, url: str , browser_name:str = "chromium"
368374
await page.goto(url, wait_until="networkidle")
369375
results = await page.content()
370376
logger.info("Content scraped after JavaScript rendering")
371-
break
377+
return results
372378
except (aiohttp.ClientError, asyncio.TimeoutError, Exception) as e:
373379
attempt += 1
374380
logger.error(f"Attempt {attempt} failed: {e}")
375-
if attempt == self.RETRY_LIMIT:
376-
results = (
377-
f"Error: Network error after {self.RETRY_LIMIT} attempts - {e}"
378-
)
381+
if attempt == self.retry_limit:
382+
raise RuntimeError(f"Failed to scrape after {self.retry_limit} attempts: {str(e)}")
379383
finally:
380384
await browser.close()
381385

382-
return results
383-
384386
def lazy_load(self) -> Iterator[Document]:
385387
"""
386388
Lazily load text content from the provided URLs.

scrapegraphai/nodes/generate_answer_node.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,8 @@
1212
from langchain_community.chat_models import ChatOllama
1313
from tqdm import tqdm
1414
from .base_node import BaseNode
15-
from ..utils.output_parser import get_structured_output_parser, get_pydantic_output_parser
15+
from ..utils.output_parser import get_pydantic_output_parser
1616
from requests.exceptions import Timeout
17-
from langchain.callbacks.manager import CallbackManager
18-
from langchain.callbacks import get_openai_callback
1917
from ..prompts import (
2018
TEMPLATE_CHUNKS, TEMPLATE_NO_CHUNKS, TEMPLATE_MERGE,
2119
TEMPLATE_CHUNKS_MD, TEMPLATE_NO_CHUNKS_MD, TEMPLATE_MERGE_MD

scrapegraphai/utils/llm_callback_manager.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,7 @@
77

88
import threading
99
from contextlib import contextmanager
10-
from langchain_community.callbacks import get_openai_callback
11-
from langchain_community.callbacks.manager import get_bedrock_anthropic_callback
10+
from langchain_community.callbacks.manager import get_openai_callback, get_bedrock_anthropic_callback
1211
from langchain_openai import ChatOpenAI, AzureChatOpenAI
1312
from langchain_aws import ChatBedrock
1413
from .custom_callback import get_custom_callback

0 commit comments

Comments
 (0)