Skip to content

reallignment #249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 182 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
182 commits
Select commit Hold shift + click to select a range
b137456
Merge pull request #149 from VinciGit00/pre/beta
VinciGit00 May 4, 2024
2878695
fix: trailing whitespace
VinciGit00 May 4, 2024
24cbb7b
ci(release): 0.9.0 [skip ci]
semantic-release-bot May 4, 2024
cb1bd00
removed unused node
VinciGit00 May 5, 2024
7cc1eda
Merge branch 'main' of https://github.com/VinciGit00/Scrapegraph-ai
VinciGit00 May 5, 2024
89a1f99
add lava integration for ollama
VinciGit00 May 6, 2024
019b722
feat: add llava integration
VinciGit00 May 6, 2024
726de28
feat: Fix bug for gemini case when embeddings config not passed
shkamboj1 May 6, 2024
77505aa
Merge pull request #3 from shkamboj1/pre/beta
shkamboj1 May 6, 2024
b0573a2
Merge pull request #158 from shorthills-ai/pre/beta
VinciGit00 May 6, 2024
8c0b46e
ci(release): 0.9.0-beta.6 [skip ci]
semantic-release-bot May 6, 2024
fd01b73
fix(llm): fixed gemini api_key
PeriniM May 6, 2024
b053953
Merge pull request #159 from VinciGit00/fix-gemini-apikey
PeriniM May 6, 2024
6911e21
ci(release): 0.9.0-beta.7 [skip ci]
semantic-release-bot May 6, 2024
ac0a2e5
Update models_tokens.py
VinciGit00 May 6, 2024
8c7c3e3
Merge branch 'pre/beta' of https://github.com/VinciGit00/Scrapegraph-…
VinciGit00 May 6, 2024
e264e92
Added support for Claude 3 models from Anthropic
cemkod May 6, 2024
2ac9e16
Fixed accidental reformatting.
cemkod May 6, 2024
d5547a4
feat: add new hugging_face models
f-aguzzi May 6, 2024
f6442cc
Merge pull request #157 from VinciGit00/llava_integration
VinciGit00 May 6, 2024
739aaa3
ci(release): 0.9.0-beta.8 [skip ci]
semantic-release-bot May 6, 2024
97c3fff
Merge pull request #162 from f-aguzzi/patch-1
VinciGit00 May 6, 2024
cbd77df
removed claude
VinciGit00 May 6, 2024
ac6d200
Merge branch 'pre/beta' of https://github.com/VinciGit00/Scrapegraph-…
VinciGit00 May 6, 2024
5a67bca
Merge branch 'pre/beta' into pr/161
VinciGit00 May 6, 2024
88f04bf
Merge pull request #161 from cemkod/main
VinciGit00 May 6, 2024
c47a505
ci(release): 0.10.0-beta.1 [skip ci]
semantic-release-bot May 6, 2024
2258fe5
add new search graph examples
VinciGit00 May 6, 2024
17ec992
docs: update README.md
eltociear May 7, 2024
cbc7b1f
Merge pull request #165 from eltociear/patch-1
VinciGit00 May 7, 2024
96f9d63
Update README.md
May 7, 2024
9a873ca
Merge pull request #167 from KPCOFGS/main
VinciGit00 May 7, 2024
8632c0a
Merge pull request #169 from VinciGit00/main
VinciGit00 May 7, 2024
b1df161
Update examples.rst
kahwoo May 8, 2024
d86437f
Merge pull request #176 from kahwoo/patch-1
VinciGit00 May 8, 2024
cc28d5a
docs: fixed unused param and install
PeriniM May 8, 2024
186c0d0
fix(examples): openai std examples
PeriniM May 8, 2024
6b71ec1
fix(examples): local, mixed models and fixed SearchGraph embeddings p…
PeriniM May 8, 2024
71fcdfa
Merge pull request #177 from VinciGit00/fix-bugs
PeriniM May 8, 2024
d4c7d4e
fix: removed .lock file for deployment
PeriniM May 8, 2024
3f0e069
ci(release): 0.10.0-beta.2 [skip ci]
semantic-release-bot May 8, 2024
5ea4df4
Merge pull request #170 from VinciGit00/pre/beta
PeriniM May 8, 2024
0ca52b1
ci(release): 0.10.0 [skip ci]
semantic-release-bot May 8, 2024
ae5655f
docs(readme): improve main readme
PeriniM May 8, 2024
4bf90f3
docs: fixed speechgraphexample
PeriniM May 8, 2024
e7d39a5
fixed gemini embeddings
VinciGit00 May 8, 2024
4ed0fb8
feat: update info
VinciGit00 May 8, 2024
8272d73
add tokenizatio for mxbai-embed-large
VinciGit00 May 8, 2024
bd8afaf
Fixed "NameError: name 'GoogleGenerativeAIEmbeddings' is not defined"
arjuuuuunnnnn May 9, 2024
13238f4
Merge pull request #185 from arjuuuuunnnnn/main
VinciGit00 May 9, 2024
0bb68d1
Merge branch 'main' of https://github.com/VinciGit00/Scrapegraph-ai
VinciGit00 May 9, 2024
5449ebf
Merge branch 'main' of https://github.com/VinciGit00/Scrapegraph-ai
VinciGit00 May 9, 2024
7b07fdf
add groq example
VinciGit00 May 9, 2024
4039793
Update abstract_graph.py
VinciGit00 May 9, 2024
9415675
Update abstract_graph.py
VinciGit00 May 9, 2024
7e00c14
Merge pull request #183 from VinciGit00/182-googlegenerativeaiembeddi…
PeriniM May 9, 2024
ad32298
ci(release): 0.10.0-beta.3 [skip ci]
semantic-release-bot May 9, 2024
a37fbbc
fix: limit python version to < 3.12
daniele-roncaglioni May 9, 2024
590aab7
Merge pull request #193 from daniele-roncaglioni/189-poetry-python-ve…
VinciGit00 May 9, 2024
f10f3b1
feat: Add support for passing pdf path as source
shkamboj1 May 9, 2024
905b345
Merge pull request #4 from shkamboj1/pre/beta
shkamboj1 May 9, 2024
a1d580c
Merge pull request #195 from shorthills-ai/pre/beta
VinciGit00 May 9, 2024
84e8d12
update lock
VinciGit00 May 9, 2024
548bff9
ci(release): 0.10.0-beta.4 [skip ci]
semantic-release-bot May 9, 2024
324e977
fix: fixed bugs for csv and xml
VinciGit00 May 9, 2024
c32caad
Merge branch 'pre/beta' of https://github.com/VinciGit00/Scrapegraph-…
VinciGit00 May 9, 2024
28c9dce
ci(release): 0.10.0-beta.5 [skip ci]
semantic-release-bot May 9, 2024
0ab31c3
fix: add json integration
VinciGit00 May 9, 2024
460d292
ci(release): 0.10.0-beta.6 [skip ci]
semantic-release-bot May 9, 2024
f8ce3d5
fix: Augment the information getting fetched from a webpage
mayurdb May 10, 2024
772e064
docs: Update README.md
lurenss May 10, 2024
82318b9
add sponsor
VinciGit00 May 10, 2024
7ee5078
Merge branch 'main' of https://github.com/VinciGit00/Scrapegraph-ai
VinciGit00 May 10, 2024
67b8a14
Update README.md
VinciGit00 May 10, 2024
f8d8d71
docs: updated sponsor logo
PeriniM May 10, 2024
99adc97
Merge branch 'pre/beta' into fetchNodeFix
VinciGit00 May 10, 2024
4e62689
Merge pull request #203 from mayurdb/fetchNodeFix
VinciGit00 May 10, 2024
63c0dd9
ci(release): 0.11.0-beta.1 [skip ci]
semantic-release-bot May 10, 2024
702e913
Update README.md
VinciGit00 May 10, 2024
198420c
docs: update instructions to use with LocalAI
mudler May 10, 2024
a433399
Merge pull request #205 from mudler/patch-1
VinciGit00 May 10, 2024
86be41e
Revert "docs: update instructions to use with LocalAI"
PeriniM May 10, 2024
b5c1a7b
Merge pull request #206 from VinciGit00/revert-205-patch-1
PeriniM May 10, 2024
23b1e5f
Merge branch 'main' into docs
PeriniM May 10, 2024
b8a8ebb
Merge pull request #207 from VinciGit00/docs
PeriniM May 10, 2024
864aa91
feat: revert fetch_node
PeriniM May 10, 2024
7ae50c0
ci(release): 0.11.0-beta.2 [skip ci]
semantic-release-bot May 10, 2024
2f4fd45
fix(pytest): add dependency for mocking testing functions
DiTo97 May 10, 2024
db2234b
feat(webdriver-backend): add dynamic import scripts from module and file
DiTo97 May 10, 2024
2170131
feat(proxy-rotation): add parse (IP address) or search (from broker) …
DiTo97 May 10, 2024
768719c
feat(safe-web-driver): enchanced the original `AsyncChromiumLoader` w…
DiTo97 May 10, 2024
fc2aa3a
Merge branch 'pre/beta' of https://github.com/DiTo97/Scrapegraph-ai i…
DiTo97 May 10, 2024
67d8fec
Minor typo fix for clarity
epage480 May 10, 2024
627cbee
feat(parallel-exeuction): add asyncio event loop dispatcher with sema…
DiTo97 May 10, 2024
4088474
Added parse_html option in parse_node
epage480 May 10, 2024
aac51ba
Removed dead code, allows GenerateScraperNode to generate scraper with
epage480 May 10, 2024
24c3b05
Removed nonfunctional RAG node from ScriptCreatorGraph
epage480 May 10, 2024
0683e78
Merge branch 'pre/beta' into fix-GenerateScraperGraph
epage480 May 10, 2024
300fd5d
Fetch links in the page while parsing html
mayurdb May 11, 2024
1fa77e5
Merge pull request #215 from epage480/fix-GenerateScraperGraph
VinciGit00 May 11, 2024
b752499
Merge pull request #217 from mayurdb/fetchLinkFix
VinciGit00 May 11, 2024
04a4d84
Update serp_api_logo.png
VinciGit00 May 11, 2024
78f2174
Merge branch 'main' of https://github.com/VinciGit00/Scrapegraph-ai
VinciGit00 May 11, 2024
2563773
fix: crash asyncio due dependency version
lurenss May 11, 2024
d359814
ci(release): 0.10.1 [skip ci]
semantic-release-bot May 11, 2024
dc91719
Update cleanup_html.py
VinciGit00 May 11, 2024
b54d984
fix(chromium-loader): ensure it subclasses langchain's base loader
DiTo97 May 11, 2024
76b0e39
update tests
VinciGit00 May 11, 2024
13ae918
docs: add diagram showing general structure/flow of the library
daniele-roncaglioni May 11, 2024
df271b6
Add search link node that can find out relevant links in the webpage
mayurdb May 11, 2024
8f1fbe7
minor changes
mayurdb May 11, 2024
ea3b545
Merge branch 'pre/beta' into deepScrape
mayurdb May 11, 2024
9a67a26
Update documentation
mayurdb May 11, 2024
dd29c16
Merge branch 'deepScrape' of github.com:mayurdb/Scrapegraph-ai into d…
mayurdb May 11, 2024
d8ed76b
Merge pull request #221 from mayurdb/deepScrape
VinciGit00 May 11, 2024
b441b30
docs: update overview diagram with more models
daniele-roncaglioni May 11, 2024
3b9ec9b
Merge pull request #220 from daniele-roncaglioni/102-library-overview…
VinciGit00 May 11, 2024
156b67b
feat: add support for deepseek-chat
f-aguzzi May 11, 2024
e004c7c
Merge pull request #223 from f-aguzzi/pre/beta
VinciGit00 May 12, 2024
106fb12
ci(release): 0.11.0-beta.3 [skip ci]
semantic-release-bot May 12, 2024
e2350ed
feat: add new prompt info
VinciGit00 May 12, 2024
f359d5c
Merge pull request #224 from VinciGit00/fixing-prompts
VinciGit00 May 12, 2024
4ccddda
ci(release): 0.11.0-beta.4 [skip ci]
semantic-release-bot May 12, 2024
1e9a564
fix(proxy-rotation): removed duplicated arg and passed the loader_kwa…
PeriniM May 12, 2024
30758b4
Create smart_scarper_deepseek.py
VinciGit00 May 12, 2024
5d6d996
fix(proxy-rotation): removed max_shape duplicate
PeriniM May 13, 2024
e256b75
docs(refactor): added proxy-rotation usage and refactor readthedocs
PeriniM May 13, 2024
0c36a7e
feat: added proxy rotation
PeriniM May 13, 2024
7e8acd8
Merge branch 'pre/beta' into fix/fetch-node-proxybroker
PeriniM May 13, 2024
b8079f8
Merge pull request #211 from DiTo97/fix/fetch-node-proxybroker
PeriniM May 13, 2024
fc56d6b
Update README.md
VinciGit00 May 13, 2024
353382b
ci(release): 0.11.0-beta.5 [skip ci]
semantic-release-bot May 13, 2024
0c15947
fix(fetch-node): removed isSoup from default
PeriniM May 13, 2024
2724d3d
ci(release): 0.11.0-beta.6 [skip ci]
semantic-release-bot May 13, 2024
c7ec114
docs(refactor): changed example
PeriniM May 13, 2024
60ed80f
Merge branch 'pre/beta' of https://github.com/VinciGit00/Scrapegraph-…
PeriniM May 13, 2024
7c91f9f
add examples for deepseek
VinciGit00 May 13, 2024
39be38f
Fixed anthropic/bedrock conflict; Removed duplicate class Claude; Upd…
JGalego May 13, 2024
d0167de
fix: bug for claude
VinciGit00 May 13, 2024
f0f7373
ci(release): 0.11.0-beta.7 [skip ci]
semantic-release-bot May 13, 2024
f3d44c0
Merge pull request #228 from JGalego/fix/bedrock-support
VinciGit00 May 13, 2024
dedc733
fix(asyncio): replaced deepcopy with copy due to serialization problems
PeriniM May 13, 2024
859c5d5
Refactored to include custom AWS client for bedrock; Added missing An…
JGalego May 13, 2024
28ab8da
Merge pull request #229 from JGalego/feat/custom-aws-creds
VinciGit00 May 13, 2024
c0d26d6
ad bedrocl
VinciGit00 May 13, 2024
d9752b1
chore: update models_tokens.py with new model configurations
arsaboo May 13, 2024
a8d5e7d
feat(batchsize): tested different batch sizes and systems
PeriniM May 13, 2024
367dea5
Merge branch 'pre/beta' into feat/parallel-node-execution
PeriniM May 13, 2024
62a74a5
Merge pull request #213 from DiTo97/feat/parallel-node-execution
PeriniM May 13, 2024
df918fa
Merge pull request #231 from arsaboo/models
PeriniM May 13, 2024
fa4edb4
ci(release): 0.11.0-beta.8 [skip ci]
semantic-release-bot May 13, 2024
ced2bbc
docs(concurrent): refactor theme and added benchmarck searchgraph
PeriniM May 14, 2024
4fd8a39
Merge branch 'pre/beta' of https://github.com/VinciGit00/Scrapegraph-…
PeriniM May 14, 2024
d6f5ca8
Merge branch 'main' into pre/beta
VinciGit00 May 14, 2024
5914fa8
Update poetry.lock
VinciGit00 May 14, 2024
d2877d8
ci(release): 0.11.0-beta.9 [skip ci]
semantic-release-bot May 14, 2024
52a4a3b
feat: add gpt-4o
f-aguzzi May 14, 2024
8e46799
Merge pull request #235 from f-aguzzi/pre/beta
PeriniM May 14, 2024
218b8ed
ci(release): 0.11.0-beta.10 [skip ci]
semantic-release-bot May 14, 2024
90955ca
feat(gpt-4o): image to text single node test
PeriniM May 14, 2024
a296927
feat(omni-scraper): working OmniScraperGraph with images
PeriniM May 14, 2024
fcb3abb
feat(omni-search): added omni search graph and updated docs
PeriniM May 14, 2024
a6e1813
fix(fetch_node): bug in handling local files
PeriniM May 14, 2024
a458ec4
Update the prompt for the search_link_node
mayurdb May 14, 2024
d76badd
Merge pull request #239 from mayurdb/deepScrapeFix
VinciGit00 May 14, 2024
932df8d
Merge pull request #238 from VinciGit00/gpt4-omni
VinciGit00 May 14, 2024
8727d03
ci(release): 0.11.0-beta.11 [skip ci]
semantic-release-bot May 14, 2024
2a57940
Merge pull request #234 from VinciGit00/pre/beta
VinciGit00 May 14, 2024
c55a3b1
ci(release): 0.11.0 [skip ci]
semantic-release-bot May 14, 2024
b0a67ba
fix(docs): requirements-dev
PeriniM May 14, 2024
6effe25
ci(release): 0.11.1 [skip ci]
semantic-release-bot May 14, 2024
78d1940
docs(main-readme): fixed some typos
PeriniM May 15, 2024
8fc2510
chore(package manager)!: move from poetry to rye
f-aguzzi May 15, 2024
672bd29
Merge pull request #244 from f-aguzzi/main
VinciGit00 May 15, 2024
c0b6f02
ci(release): 1.0.0 [skip ci]
semantic-release-bot May 15, 2024
24d56af
Update pyproject.toml
VinciGit00 May 15, 2024
096b665
fix(searchgraph): used shallow copy to serialize obj
PeriniM May 15, 2024
a81d2b7
ci(release): 1.0.1 [skip ci]
semantic-release-bot May 15, 2024
7ccd51a
add rye update script bash
VinciGit00 May 15, 2024
694d3ab
Merge branch 'main' of https://github.com/VinciGit00/Scrapegraph-ai
VinciGit00 May 15, 2024
efb781f
docs(rye): replaced poetry with rye
PeriniM May 15, 2024
22cd9e3
Merge branch 'search_link_context' into main
VinciGit00 May 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 4 additions & 7 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,8 @@ jobs:
run: |
sudo apt update
sudo apt install -y git
- name: Install Python Env and Poetry
uses: actions/setup-python@v5
with:
python-version: '3.9'
- run: pip install poetry
- name: Install the latest version of rye
uses: eifinger/setup-rye@v3
- name: Install Node Env
uses: actions/setup-node@v4
with:
Expand All @@ -30,8 +27,8 @@ jobs:
persist-credentials: false
- name: Build app
run: |
poetry install
poetry build
rye sync --no-lock
rye build
id: build_cache
if: success()
- name: Cache build
Expand Down
6 changes: 2 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,6 @@ examples/graph_examples/ScrapeGraphAI_generated_graph
examples/**/result.csv
examples/**/result.json
main.py
poetry.lock

# lock files
*.python-version
*.lock
poetry.lock
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.9.19
354 changes: 354 additions & 0 deletions CHANGELOG.md

Large diffs are not rendered by default.

188 changes: 61 additions & 127 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)


ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).

Just say which information you want to extract and the library will do it for you!

<p align="center">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
</p>


## 🚀 Quick install

The reference page for Scrapegraph-ai is available on the official page of pypy: [pypi](https://pypi.org/project/scrapegraphai/).
Expand All @@ -39,20 +39,23 @@ Try it directly on the web using Google Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)

Follow the procedure on the following link to setup your OpenAI API key: [link](https://scrapegraph-ai.readthedocs.io/en/latest/index.html).

## 📖 Documentation

The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).

Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).

## 💻 Usage
You can use the `SmartScraper` class to extract information from a website using a prompt.
There are three main scraping pipelines that can be used to extract information from a website (or local file):
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.

It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.

### Case 1: SmartScraper using Local Models

The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
### Case 1: Extracting information using Ollama
Remember to download the model on Ollama separately!
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.

```python
from scrapegraphai.graphs import SmartScraperGraph
Expand All @@ -67,11 +70,12 @@ graph_config = {
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434", # set Ollama URL
}
},
"verbose": True,
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the articles",
prompt="List me all the projects with their descriptions",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects",
config=graph_config
Expand All @@ -82,160 +86,86 @@ print(result)

```

### Case 2: Extracting information using Docker
The output will be a list of projects with their descriptions like the following:

Note: before using the local model remember to create the docker container!
```text
docker-compose up -d
docker exec -it ollama ollama pull stablelm-zephyr
```
You can use which models avaiable on Ollama or your own model instead of stablelm-zephyr
```python
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json", # Ollama needs the format to be specified explicitly
# "model_tokens": 2000, # set context length arbitrarily
},
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the articles",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
```

### Case 2: SearchGraph using Mixed Models

### Case 3: Extracting information using Openai model
```python
from scrapegraphai.graphs import SmartScraperGraph
OPENAI_API_KEY = "YOUR_API_KEY"

graph_config = {
"llm": {
"api_key": OPENAI_API_KEY,
"model": "gpt-3.5-turbo",
},
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the articles",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects",
config=graph_config
)
We use **Groq** for the LLM and **Ollama** for the embeddings.

result = smart_scraper_graph.run()
print(result)
```

### Case 4: Extracting information using Groq
```python
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

groq_key = os.getenv("GROQ_APIKEY")
from scrapegraphai.graphs import SearchGraph

# Define the configuration for the graph
graph_config = {
"llm": {
"model": "groq/gemma-7b-it",
"api_key": groq_key,
"api_key": "GROQ_API_KEY",
"temperature": 0
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
"base_url": "http://localhost:11434",
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"headless": False
"max_results": 5,
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their description and the author.",
source="https://perinim.github.io/projects",
# Create the SearchGraph instance
search_graph = SearchGraph(
prompt="List me all the traditional recipes from Chioggia",
config=graph_config
)

result = smart_scraper_graph.run()
# Run the graph
result = search_graph.run()
print(result)
```

The output will be a list of recipes like the following:

### Case 5: Extracting information using Azure
```python
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings

lm_model_instance = AzureChatOpenAI(
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
)

embedder_model_instance = AzureOpenAIEmbeddings(
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
graph_config = {
"llm": {"model_instance": llm_model_instance},
"embeddings": {"model_instance": embedder_model_instance}
}

smart_scraper_graph = SmartScraperGraph(
prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
event_end_date, event_end_time, location, event_mode, event_category,
third_party_redirect, no_of_days,
time_in_hours, hosted_or_attending, refreshments_type,
registration_available, registration_link""",
source="https://www.hmhco.com/event",
config=graph_config
)
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
```
### Case 3: SpeechGraph using OpenAI

You just need to pass the OpenAI API key and the model name.

### Case 6: Extracting information using Gemini
```python
from scrapegraphai.graphs import SmartScraperGraph
GOOGLE_APIKEY = "YOUR_API_KEY"
from scrapegraphai.graphs import SpeechGraph

# Define the configuration for the graph
graph_config = {
"llm": {
"api_key": GOOGLE_APIKEY,
"model": "gemini-pro",
"api_key": "OPENAI_API_KEY",
"model": "gpt-3.5-turbo",
},
"tts_model": {
"api_key": "OPENAI_API_KEY",
"model": "tts-1",
"voice": "alloy"
},
"output_path": "audio_summary.mp3",
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the articles",
source="https://perinim.github.io/projects",
config=graph_config
# ************************************************
# Create the SpeechGraph instance and run it
# ************************************************

speech_graph = SpeechGraph(
prompt="Make a detailed audio summary of the projects.",
source="https://perinim.github.io/projects/",
config=graph_config,
)

result = smart_scraper_graph.run()
result = speech_graph.run()
print(result)
```

The output for all 3 the cases will be a dictionary with the extracted information, for example:

```bash
{
'titles': [
'Rotary Pendulum RL'
],
'descriptions': [
'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
]
}
```

The output will be an audio file with the summary of the projects on the page.

## 🤝 Contributing

Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!
Expand All @@ -247,12 +177,16 @@ Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegra
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)

## 📈 Roadmap
Check out the project roadmap [here](docs/README.md)! 🚀
Check out the project roadmap [here](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/README.md)! 🚀

Wanna visualize the roadmap in a more interactive way? Check out the [markmap](https://markmap.js.org/repl) visualization by copy pasting the markdown content in the editor!

## ❤️ Contributors
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
## Sponsors
<p align="center">
<a href="https://serpapi.com?utm_source=scrapegraphai"><img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;"></a>
</p>

## 🎓 Citations
If you have used our library for research purposes please quote us with the following reference:
Expand All @@ -269,7 +203,7 @@ If you have used our library for research purposes please quote us with the foll
## Authors

<p align="center">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors Logos">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors_logos">
</p>

| | Contact Info |
Expand All @@ -285,4 +219,4 @@ ScrapeGraphAI is licensed under the MIT License. See the [LICENSE](https://githu
## Acknowledgements

- We would like to thank all the contributors to the project and the open-source community for their support.
- ScrapeGraphAI is meant to be used for data exploration and research purposes only. We are not responsible for any misuse of the library.
- ScrapeGraphAI is meant to be used for data exploration and research purposes only. We are not responsible for any misuse of the library.
Binary file added docs/assets/omniscrapergraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/omnisearchgraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/project_overview_diagram.fig
Binary file not shown.
Binary file added docs/assets/project_overview_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/searchgraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/serp_api_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/smartscrapergraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/speechgraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 22 additions & 6 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,36 @@
# import all the modules
sys.path.insert(0, os.path.abspath('../../'))

project = 'scrapegraphai'
copyright = '2024, Marco Vinciguerra'
author = 'Marco Vinciguerra'
project = 'ScrapeGraphAI'
copyright = '2024, ScrapeGraphAI'
author = 'Marco Vinciguerra, Marco Perini, Lorenzo Padoan'

html_last_updated_fmt = "%b %d, %Y"

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon']
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon','sphinx_wagtail_theme']

templates_path = ['_templates']
exclude_patterns = []

# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']
# html_theme = 'sphinx_rtd_theme'
html_theme = 'sphinx_wagtail_theme'

html_theme_options = dict(
project_name = "ScrapeGraphAI",
logo = "scrapegraphai_logo.png",
logo_alt = "ScrapeGraphAI",
logo_height = 59,
logo_url = "https://scrapegraph-ai.readthedocs.io/en/latest/",
logo_width = 45,
github_url = "https://github.com/VinciGit00/Scrapegraph-ai/tree/main/docs/source/",
footer_links = ",".join(
["Landing Page|https://scrapegraphai.com/",
"Docusaurus|https://scrapegraph-doc.onrender.com/docs/intro"]
),
)
Loading
Loading