Skip to content

Multiple Branches #370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ docs/source/_static/
venv/
.venv/
.vscode/
.conda/

# exclude pdf, mp3
*.pdf
Expand All @@ -38,3 +39,6 @@ lib/
*.html
.idea

# extras
cache/
run_smart_scraper.py
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,14 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).

## 💻 Usage
There are three main scraping pipelines that can be used to extract information from a website (or local file):
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file):
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
- `SmartScraperMultiGraph`: multiple page scraper given a single prompt
- `ScriptCreatorGraph`: single-page scraper that extracts information from a website and generates a Python script.

- `SmartScraperMultiGraph`: multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources;
- `ScriptCreatorMultiGraph`: multi-page scraper that generates a Python script for extracting information from multiple pages given a single prompt and a list of sources.

It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.

Expand Down
Binary file added docs/assets/scriptcreatorgraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/scrapers/graph_config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Some interesting ones are:
- `loader_kwargs`: A dictionary with additional parameters to be passed to the `Loader` class, such as `proxy`.
- `burr_kwargs`: A dictionary with additional parameters to enable `Burr` graphical user interface.
- `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.
- `cache_path`: The path where the cache files will be saved. If already exists, the cache will be loaded from this path.

.. _Burr:

Expand Down
41 changes: 39 additions & 2 deletions docs/source/scrapers/graphs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,15 @@ Graphs are scraping pipelines aimed at solving specific tasks. They are composed
There are several types of graphs available in the library, each with its own purpose and functionality. The most common ones are:

- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information using LLM.
- **SmartScraperMultiGraph**: multi-page scraper that requires a user-defined prompt and a list of URLs (or local files) to extract information using LLM. It is built on top of SmartScraperGraph.
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
- **ScriptCreatorGraph**: script generator that creates a Python script to scrape a website using the specified library (e.g. BeautifulSoup). It requires a user-defined prompt and a URL (or local file).

There are also two additional graphs that can handle multiple sources:

- **SmartScraperMultiGraph**: similar to `SmartScraperGraph`, but with the ability to handle multiple sources.
- **ScriptCreatorMultiGraph**: similar to `ScriptCreatorGraph`, but with the ability to handle multiple sources.

With the introduction of `GPT-4o`, two new powerful graphs have been created:

- **OmniScraperGraph**: similar to `SmartScraperGraph`, but with the ability to scrape images and describe them.
Expand Down Expand Up @@ -186,4 +190,37 @@ It will fetch the data from the source, extract the information based on the pro
)

result = speech_graph.run()
print(result)
print(result)


ScriptCreatorGraph & ScriptCreatorMultiGraph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. image:: ../../assets/scriptcreatorgraph.png
:align: center
:width: 90%
:alt: ScriptCreatorGraph

First we define the graph configuration, which includes the LLM model and other parameters.
Then we create an instance of the ScriptCreatorGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.

.. code-block:: python

from scrapegraphai.graphs import ScriptCreatorGraph

graph_config = {
"llm": {...},
"library": "beautifulsoup4"
}

script_creator_graph = ScriptCreatorGraph(
prompt="Create a Python script to scrape the projects.",
source="https://perinim.github.io/projects/",
config=graph_config,
schema=schema
)

result = script_creator_graph.run()
print(result)

**ScriptCreatorMultiGraph** is similar to ScriptCreatorGraph, but it can handle multiple sources. We define the graph configuration, create an instance of the ScriptCreatorMultiGraph class, and run the graph.
53 changes: 53 additions & 0 deletions examples/anthropic/script_multi_generator_haiku.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
"""
Basic example of scraping pipeline using ScriptCreatorGraph
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import ScriptCreatorMultiGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
"llm": {
"api_key": os.getenv("ANTHROPIC_API_KEY"),
"model": "claude-3-haiku-20240307",
"max_tokens": 4000
},
"library": "beautifulsoup"
}

# ************************************************
# Create the ScriptCreatorGraph instance and run it
# ************************************************

urls=[
"https://schultzbergagency.com/emil-raste-karlsen/",
"https://schultzbergagency.com/johanna-hedberg/",
]

# ************************************************
# Create the ScriptCreatorGraph instance and run it
# ************************************************

script_creator_graph = ScriptCreatorMultiGraph(
prompt="Find information about actors",
# also accepts a string with the already downloaded HTML code
source=urls,
config=graph_config
)

result = script_creator_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = script_creator_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
25 changes: 4 additions & 21 deletions examples/anthropic/smart_scraper_multi_haiku.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,31 +12,14 @@
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

"""
Basic example of scraping pipeline using SmartScraper
"""

import os, json
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperMultiGraph

load_dotenv()

# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-4o",
},
"verbose": True,
"headless": False,
"api_key": os.getenv("ANTHROPIC_API_KEY"),
"model": "claude-3-haiku-20240307",
"max_tokens": 4000
},
}

# *******************************************************
Expand Down
3 changes: 2 additions & 1 deletion examples/azure/script_generator_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@
)
graph_config = {
"llm": {"model_instance": llm_model_instance},
"embeddings": {"model_instance": embedder_model_instance}
"embeddings": {"model_instance": embedder_model_instance},
"library": "beautifulsoup"
}

# ************************************************
Expand Down
61 changes: 61 additions & 0 deletions examples/azure/script_multi_generator_azure.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
"""
Basic example of scraping pipeline using ScriptCreatorGraph
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import ScriptCreatorMultiGraph
from scrapegraphai.utils import prettify_exec_info
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings

load_dotenv()

# ************************************************
# Define the configuration for the graph
# ************************************************
llm_model_instance = AzureChatOpenAI(
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
)

embedder_model_instance = AzureOpenAIEmbeddings(
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
graph_config = {
"llm": {"model_instance": llm_model_instance},
"embeddings": {"model_instance": embedder_model_instance},
"library": "beautifulsoup"
}


# ************************************************
# Create the ScriptCreatorGraph instance and run it
# ************************************************

urls=[
"https://schultzbergagency.com/emil-raste-karlsen/",
"https://schultzbergagency.com/johanna-hedberg/",
]

# ************************************************
# Create the ScriptCreatorGraph instance and run it
# ************************************************

script_creator_graph = ScriptCreatorMultiGraph(
prompt="Find information about actors",
# also accepts a string with the already downloaded HTML code
source=urls,
config=graph_config
)

result = script_creator_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = script_creator_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
52 changes: 52 additions & 0 deletions examples/bedrock/script_multi_generator_bedrock.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
"""
Basic example of scraping pipeline using ScriptCreatorGraph
"""

from scrapegraphai.graphs import ScriptCreatorMultiGraph
from scrapegraphai.utils import prettify_exec_info

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
"llm": {
"client": "client_name",
"model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
"temperature": 0.0
},
"embeddings": {
"model": "bedrock/cohere.embed-multilingual-v3"
},
"library": "beautifulsoup"
}

# ************************************************
# Create the ScriptCreatorGraph instance and run it
# ************************************************

urls=[
"https://schultzbergagency.com/emil-raste-karlsen/",
"https://schultzbergagency.com/johanna-hedberg/",
]

# ************************************************
# Create the ScriptCreatorGraph instance and run it
# ************************************************

script_creator_graph = ScriptCreatorMultiGraph(
prompt="Find information about actors",
# also accepts a string with the already downloaded HTML code
source=urls,
config=graph_config
)

result = script_creator_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = script_creator_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
60 changes: 60 additions & 0 deletions examples/deepseek/script_multi_generator_deepseek.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
"""
Basic example of scraping pipeline using ScriptCreatorGraph
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import ScriptCreatorMultiGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

# ************************************************
# Define the configuration for the graph
# ************************************************

deepseek_key = os.getenv("DEEPSEEK_APIKEY")

graph_config = {
"llm": {
"model": "deepseek-chat",
"openai_api_key": deepseek_key,
"openai_api_base": 'https://api.deepseek.com/v1',
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"library": "beautifulsoup"
}

# ************************************************
# Create the ScriptCreatorGraph instance and run it
# ************************************************

urls=[
"https://schultzbergagency.com/emil-raste-karlsen/",
"https://schultzbergagency.com/johanna-hedberg/",
]

# ************************************************
# Create the ScriptCreatorGraph instance and run it
# ************************************************

script_creator_graph = ScriptCreatorMultiGraph(
prompt="Find information about actors",
# also accepts a string with the already downloaded HTML code
source=urls,
config=graph_config
)

result = script_creator_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = script_creator_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
Loading
Loading