Skip to content

Commit 9952d98

Browse files
committed
Merge branch 'pre/beta' of https://github.com/VinciGit00/Scrapegraph-ai into pre/beta
2 parents 1e7f334 + 0145b8f commit 9952d98

38 files changed

+1099
-86
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ docs/source/_static/
2323
venv/
2424
.venv/
2525
.vscode/
26+
.conda/
2627

2728
# exclude pdf, mp3
2829
*.pdf
@@ -38,3 +39,6 @@ lib/
3839
*.html
3940
.idea
4041

42+
# extras
43+
cache/
44+
run_smart_scraper.py

CHANGELOG.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,34 @@
1+
## [1.7.0-beta.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.7.0-beta.3...v1.7.0-beta.4) (2024-06-12)
2+
3+
4+
### Bug Fixes
5+
6+
* common params ([6b4cdf9](https://github.com/VinciGit00/Scrapegraph-ai/commit/6b4cdf92b82fa143e4217a2e5da46d04f2585de8))
7+
8+
## [1.7.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.7.0-beta.2...v1.7.0-beta.3) (2024-06-11)
9+
10+
11+
### Features
12+
13+
* add caching ([d790361](https://github.com/VinciGit00/Scrapegraph-ai/commit/d79036149a3197a385b73553f29df66d36480c38))
14+
* add dynamic caching ([7ed2fe8](https://github.com/VinciGit00/Scrapegraph-ai/commit/7ed2fe8ef0d16fd93cb2ff88840bcaa643349e33))
15+
* add new chunking function ([e1f045b](https://github.com/VinciGit00/Scrapegraph-ai/commit/e1f045b2809fc7db0c252f4c6f2f9a435c66ba91))
16+
* **merge:** add scriptcreatormulti, rag cache and semchunk ([15421ef](https://github.com/VinciGit00/Scrapegraph-ai/commit/15421eff7009b80293f7d84df5086d22944dfb99))
17+
* **schema:** merge scripts to follow pydantic schema ([5d692bf](https://github.com/VinciGit00/Scrapegraph-ai/commit/5d692bff9e4f124146dd37e573f7c3c0aa8d9a23))
18+
* refactoring of rag node ([7a13a68](https://github.com/VinciGit00/Scrapegraph-ai/commit/7a13a6819ff35a6f6197ee837d0eb8ea65e31776))
19+
20+
21+
### Bug Fixes
22+
23+
* **cache:** correctly pass the node arguments and logging ([c881f64](https://github.com/VinciGit00/Scrapegraph-ai/commit/c881f64209a86a69ddd3105f5d0360d9ed183490))
24+
* **node:** fixed generate answer node pydantic schema ([ab00f23](https://github.com/VinciGit00/Scrapegraph-ai/commit/ab00f23d859c64995ccfe329b24379cf3c14d73c))
25+
26+
27+
### Docs
28+
29+
* **cache:** added cache_path param ([edddb68](https://github.com/VinciGit00/Scrapegraph-ai/commit/edddb682d06262088885e340b7b73cc70adf9583))
30+
* **scriptcreator:** enhance documentation ([650c3aa](https://github.com/VinciGit00/Scrapegraph-ai/commit/650c3aaa60dab169358c2c04bfca9dee8d1a5d68))
31+
132
## [1.7.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.7.0-beta.1...v1.7.0-beta.2) (2024-06-10)
233

334

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,14 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r
4343
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
4444

4545
## 💻 Usage
46-
There are three main scraping pipelines that can be used to extract information from a website (or local file):
46+
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file):
4747
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
4848
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
4949
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
50-
- `SmartScraperMultiGraph`: multiple page scraper given a single prompt
50+
- `ScriptCreatorGraph`: single-page scraper that extracts information from a website and generates a Python script.
51+
52+
- `SmartScraperMultiGraph`: multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources;
53+
- `ScriptCreatorMultiGraph`: multi-page scraper that generates a Python script for extracting information from multiple pages given a single prompt and a list of sources.
5154

5255
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
5356

docs/assets/scriptcreatorgraph.png

53.7 KB
Loading

docs/source/scrapers/graph_config.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Some interesting ones are:
1313
- `loader_kwargs`: A dictionary with additional parameters to be passed to the `Loader` class, such as `proxy`.
1414
- `burr_kwargs`: A dictionary with additional parameters to enable `Burr` graphical user interface.
1515
- `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.
16+
- `cache_path`: The path where the cache files will be saved. If already exists, the cache will be loaded from this path.
1617

1718
.. _Burr:
1819

docs/source/scrapers/graphs.rst

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,15 @@ Graphs are scraping pipelines aimed at solving specific tasks. They are composed
66
There are several types of graphs available in the library, each with its own purpose and functionality. The most common ones are:
77

88
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information using LLM.
9-
- **SmartScraperMultiGraph**: multi-page scraper that requires a user-defined prompt and a list of URLs (or local files) to extract information using LLM. It is built on top of SmartScraperGraph.
109
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
1110
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
1211
- **ScriptCreatorGraph**: script generator that creates a Python script to scrape a website using the specified library (e.g. BeautifulSoup). It requires a user-defined prompt and a URL (or local file).
1312

13+
There are also two additional graphs that can handle multiple sources:
14+
15+
- **SmartScraperMultiGraph**: similar to `SmartScraperGraph`, but with the ability to handle multiple sources.
16+
- **ScriptCreatorMultiGraph**: similar to `ScriptCreatorGraph`, but with the ability to handle multiple sources.
17+
1418
With the introduction of `GPT-4o`, two new powerful graphs have been created:
1519

1620
- **OmniScraperGraph**: similar to `SmartScraperGraph`, but with the ability to scrape images and describe them.
@@ -186,4 +190,37 @@ It will fetch the data from the source, extract the information based on the pro
186190
)
187191
188192
result = speech_graph.run()
189-
print(result)
193+
print(result)
194+
195+
196+
ScriptCreatorGraph & ScriptCreatorMultiGraph
197+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
198+
199+
.. image:: ../../assets/scriptcreatorgraph.png
200+
:align: center
201+
:width: 90%
202+
:alt: ScriptCreatorGraph
203+
204+
First we define the graph configuration, which includes the LLM model and other parameters.
205+
Then we create an instance of the ScriptCreatorGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
206+
207+
.. code-block:: python
208+
209+
from scrapegraphai.graphs import ScriptCreatorGraph
210+
211+
graph_config = {
212+
"llm": {...},
213+
"library": "beautifulsoup4"
214+
}
215+
216+
script_creator_graph = ScriptCreatorGraph(
217+
prompt="Create a Python script to scrape the projects.",
218+
source="https://perinim.github.io/projects/",
219+
config=graph_config,
220+
schema=schema
221+
)
222+
223+
result = script_creator_graph.run()
224+
print(result)
225+
226+
**ScriptCreatorMultiGraph** is similar to ScriptCreatorGraph, but it can handle multiple sources. We define the graph configuration, create an instance of the ScriptCreatorMultiGraph class, and run the graph.
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""
2+
Basic example of scraping pipeline using ScriptCreatorGraph
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import ScriptCreatorMultiGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
10+
load_dotenv()
11+
12+
# ************************************************
13+
# Define the configuration for the graph
14+
# ************************************************
15+
16+
graph_config = {
17+
"llm": {
18+
"api_key": os.getenv("ANTHROPIC_API_KEY"),
19+
"model": "claude-3-haiku-20240307",
20+
"max_tokens": 4000
21+
},
22+
"library": "beautifulsoup"
23+
}
24+
25+
# ************************************************
26+
# Create the ScriptCreatorGraph instance and run it
27+
# ************************************************
28+
29+
urls=[
30+
"https://schultzbergagency.com/emil-raste-karlsen/",
31+
"https://schultzbergagency.com/johanna-hedberg/",
32+
]
33+
34+
# ************************************************
35+
# Create the ScriptCreatorGraph instance and run it
36+
# ************************************************
37+
38+
script_creator_graph = ScriptCreatorMultiGraph(
39+
prompt="Find information about actors",
40+
# also accepts a string with the already downloaded HTML code
41+
source=urls,
42+
config=graph_config
43+
)
44+
45+
result = script_creator_graph.run()
46+
print(result)
47+
48+
# ************************************************
49+
# Get graph execution info
50+
# ************************************************
51+
52+
graph_exec_info = script_creator_graph.get_execution_info()
53+
print(prettify_exec_info(graph_exec_info))

examples/anthropic/smart_scraper_multi_haiku.py

Lines changed: 4 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -12,31 +12,14 @@
1212
# Define the configuration for the graph
1313
# ************************************************
1414

15-
openai_key = os.getenv("OPENAI_APIKEY")
16-
17-
"""
18-
Basic example of scraping pipeline using SmartScraper
19-
"""
20-
21-
import os, json
22-
from dotenv import load_dotenv
23-
from scrapegraphai.graphs import SmartScraperMultiGraph
24-
2515
load_dotenv()
2616

27-
# ************************************************
28-
# Define the configuration for the graph
29-
# ************************************************
30-
31-
openai_key = os.getenv("OPENAI_APIKEY")
32-
3317
graph_config = {
3418
"llm": {
35-
"api_key": openai_key,
36-
"model": "gpt-4o",
37-
},
38-
"verbose": True,
39-
"headless": False,
19+
"api_key": os.getenv("ANTHROPIC_API_KEY"),
20+
"model": "claude-3-haiku-20240307",
21+
"max_tokens": 4000
22+
},
4023
}
4124

4225
# *******************************************************

examples/azure/script_generator_azure.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,8 @@
2525
)
2626
graph_config = {
2727
"llm": {"model_instance": llm_model_instance},
28-
"embeddings": {"model_instance": embedder_model_instance}
28+
"embeddings": {"model_instance": embedder_model_instance},
29+
"library": "beautifulsoup"
2930
}
3031

3132
# ************************************************
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
"""
2+
Basic example of scraping pipeline using ScriptCreatorGraph
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import ScriptCreatorMultiGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
from langchain_openai import AzureChatOpenAI
10+
from langchain_openai import AzureOpenAIEmbeddings
11+
12+
load_dotenv()
13+
14+
# ************************************************
15+
# Define the configuration for the graph
16+
# ************************************************
17+
llm_model_instance = AzureChatOpenAI(
18+
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
19+
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
20+
)
21+
22+
embedder_model_instance = AzureOpenAIEmbeddings(
23+
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
24+
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
25+
)
26+
graph_config = {
27+
"llm": {"model_instance": llm_model_instance},
28+
"embeddings": {"model_instance": embedder_model_instance},
29+
"library": "beautifulsoup"
30+
}
31+
32+
33+
# ************************************************
34+
# Create the ScriptCreatorGraph instance and run it
35+
# ************************************************
36+
37+
urls=[
38+
"https://schultzbergagency.com/emil-raste-karlsen/",
39+
"https://schultzbergagency.com/johanna-hedberg/",
40+
]
41+
42+
# ************************************************
43+
# Create the ScriptCreatorGraph instance and run it
44+
# ************************************************
45+
46+
script_creator_graph = ScriptCreatorMultiGraph(
47+
prompt="Find information about actors",
48+
# also accepts a string with the already downloaded HTML code
49+
source=urls,
50+
config=graph_config
51+
)
52+
53+
result = script_creator_graph.run()
54+
print(result)
55+
56+
# ************************************************
57+
# Get graph execution info
58+
# ************************************************
59+
60+
graph_exec_info = script_creator_graph.get_execution_info()
61+
print(prettify_exec_info(graph_exec_info))

0 commit comments

Comments
 (0)