Skip to content

Pre/beta #438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
Jul 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
40b4b2d
add new models tokens
VinciGit00 Jun 19, 2024
79a2f51
add new models tokens
VinciGit00 Jun 19, 2024
8bb560a
add convert function
VinciGit00 Jun 19, 2024
6d78375
add benchmark
VinciGit00 Jun 19, 2024
23bc633
fixed a bug
VinciGit00 Jun 19, 2024
5664eb2
Update generate_answer_node_prompts.py
VinciGit00 Jun 20, 2024
2f02830
refactoring of fetch node
VinciGit00 Jun 20, 2024
5d61238
add new convert function
VinciGit00 Jun 20, 2024
7af411a
add trigger
VinciGit00 Jun 21, 2024
d1c3de7
fixed a bug
VinciGit00 Jun 21, 2024
cf9a3d1
add test
VinciGit00 Jun 21, 2024
6549915
Update Readme.md
VinciGit00 Jun 21, 2024
afd46ac
fixed generate_answer_node
VinciGit00 Jun 22, 2024
d8fcb6c
add new examples
VinciGit00 Jun 22, 2024
9917972
fixed request
VinciGit00 Jun 22, 2024
92cabe1
add load examples from a yml file
VinciGit00 Jun 23, 2024
3a537ee
fix: add test
VinciGit00 Jun 23, 2024
df0e310
feat: add fireworks integration
VinciGit00 Jun 24, 2024
4b56604
add examples + test
VinciGit00 Jun 25, 2024
228a1de
add new force
VinciGit00 Jun 27, 2024
9b45ebc
modify fetch node with no cut mode
VinciGit00 Jun 28, 2024
2804434
feat: add integrations for markdown files
VinciGit00 Jun 29, 2024
30ca15c
Merge branch 'md_scraper_integration' into integration_markdown
VinciGit00 Jun 30, 2024
2242b10
Merge pull request #419 from VinciGit00/integration_markdown
VinciGit00 Jun 30, 2024
5fe694b
feat: improve md prompt recognition
VinciGit00 Jun 30, 2024
119514b
feat: add vertexai integration
VinciGit00 Jul 1, 2024
f3b6343
add new info
VinciGit00 Jul 1, 2024
e3a19c2
Merge pull request #428 from VinciGit00/integration_markdown
VinciGit00 Jul 1, 2024
3bf5f57
feat: add integration for infos
VinciGit00 Jul 1, 2024
ed2af51
update the chunk size
VinciGit00 Jul 2, 2024
d419b0a
Update docker-compose.yml
VinciGit00 Jul 2, 2024
3ee1743
update prompts
VinciGit00 Jul 4, 2024
583c321
chore(CI): fix pylint workflow
f-aguzzi Jul 4, 2024
afeb81f
chore(Docker): fix port number
f-aguzzi Jul 4, 2024
2ab7ddc
Merge pull request #405 from ScrapeGraphAI/integration_markdown
f-aguzzi Jul 4, 2024
720f187
Merge branch 'fireworks_integration' into support
f-aguzzi Jul 4, 2024
27c2dd2
chore(rye): rebuild lockfiles
f-aguzzi Jul 4, 2024
d77a622
Merge branch '423-add-vertex-ai-integration' into support
f-aguzzi Jul 4, 2024
591cab0
add new env
VinciGit00 Jul 4, 2024
8f4a13b
Merge pull request #437 from ScrapeGraphAI/main
VinciGit00 Jul 4, 2024
1bbd25a
Merge pull request #407 from ScrapeGraphAI/404-split-unit-testing-fro…
f-aguzzi Jul 4, 2024
8f9f96f
ci(release): 1.8.1-beta.1 [skip ci]
semantic-release-bot Jul 4, 2024
104d869
Merge branch 'pre/beta' into support
f-aguzzi Jul 4, 2024
fd6142e
Merge pull request #436 from ScrapeGraphAI/support
VinciGit00 Jul 4, 2024
8a52914
Merge pull request #429 from ScrapeGraphAI/421-default-prompt-templat…
VinciGit00 Jul 4, 2024
765b548
Merge pull request #417 from ScrapeGraphAI/md_scraper_integration
VinciGit00 Jul 4, 2024
146432d
ci(release): 1.9.0-beta.1 [skip ci]
semantic-release-bot Jul 4, 2024
ba782a6
add compatibility for versions
VinciGit00 Jul 4, 2024
ff80cbb
Merge branch 'pre/beta' of https://github.com/ScrapeGraphAI/Scrapegra…
VinciGit00 Jul 4, 2024
7570bf8
fix: fix pyproject.toml
VinciGit00 Jul 5, 2024
5cb5fbf
ci(release): 1.9.0-beta.2 [skip ci]
semantic-release-bot Jul 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 11 additions & 15 deletions .github/workflows/pylint.yml
Original file line number Diff line number Diff line change
@@ -1,30 +1,26 @@
on: [push]
on:
push:
paths:
- 'scrapegraphai/**'
- '.github/workflows/pylint.yml'

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Install the latest version of rye
uses: eifinger/setup-rye@v3
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pylint
pip install -r requirements.txt
run: rye sync --no-lock
- name: Analysing the code with pylint
run: pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py scrapegraphai/*.py
run: rye run pylint-ci
- name: Check Pylint score
run: |
pylint_score=$(pylint --disable=all --enable=metrics --output-format=text scrapegraphai/**/*.py scrapegraphai/*.py | grep 'Raw metrics' | awk '{print $4}')
pylint_score=$(rye run pylint-score-ci | grep 'Raw metrics' | awk '{print $4}')
if (( $(echo "$pylint_score < 8" | bc -l) )); then
echo "Pylint score is below 8. Blocking commit."
exit 1
else
echo "Pylint score is acceptable."
fi
fi
38 changes: 38 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,41 @@
## [1.9.0-beta.2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.9.0-beta.1...v1.9.0-beta.2) (2024-07-05)


### Bug Fixes

* fix pyproject.toml ([7570bf8](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7570bf8294e49bc54ec9e296aaadb763873390ca))

## [1.9.0-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.8.1-beta.1...v1.9.0-beta.1) (2024-07-04)


### Features

* add fireworks integration ([df0e310](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/df0e3108299071b849d7e055bd11d72764d24f08))
* add integration for infos ([3bf5f57](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3bf5f570a8f8e1b037a7ad3c9f583261a1536421))
* add integrations for markdown files ([2804434](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/2804434a9ee12c52ae8956a88b1778a4dd3ec32f))
* add vertexai integration ([119514b](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/119514bdfc2a16dfb8918b0c34ae7cc43a01384c))
* improve md prompt recognition ([5fe694b](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5fe694b6b4545a5091d16110318b992acfca4f58))


### chore

* **Docker:** fix port number ([afeb81f](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/afeb81f77a884799192d79dcac85666190fb1c9d))
* **CI:** fix pylint workflow ([583c321](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/583c32106e827f50235d8fc69511652fd4b07a35))
* **rye:** rebuild lockfiles ([27c2dd2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/27c2dd23517a7e4b14fafd00320a8b81f73145dc))

## [1.8.1-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.8.0...v1.8.1-beta.1) (2024-07-04)


### Bug Fixes

* add test ([3a537ee](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3a537eec6fef1743924a9aa5cef0ba2f8d44bf11))


### Docs

* **roadmap:** fix urls ([14faba4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/14faba4f00dd9f947f8dc5e0b51be49ea684179f))
* **roadmap:** next steps ([3e644f4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3e644f498f05eb505fbd4e94b144c81567569aaa))

## [1.8.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.7.5...v1.8.0) (2024-06-30)


Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT)
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)

ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).

Just say which information you want to extract and the library will do it for you!

Expand Down
2 changes: 1 addition & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ services:
image: ollama/ollama
container_name: ollama
ports:
- "5000:5000"
- "11434:11434"
volumes:
- ollama_volume:/root/.ollama
restart: unless-stopped
Expand Down
37 changes: 19 additions & 18 deletions examples/benchmarks/SmartScraper/Readme.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
# Local models
# Local models
The two websites benchmark are:
- Example 1: https://perinim.github.io/projects
- Example 2: https://www.wired.com (at 17/4/2024)

Both are strored locally as txt file in .txt format because in this way we do not have to think about the internet connection

| Hardware | Model | Example 1 | Example 2 |
| ------------------ | --------------------------------------- | --------- | --------- |
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 11.60s | 26.61s |
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | 8.05s | 12.17s |
| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text | 29.87s | 35.32s |
| Macbook m2 max | Llama3 on Ollama with nomic-embed-text | 18.36s | 78.32s |
| Hardware | Model | Example 1 | Example 2 |
| ---------------------- | --------------------------------------- | --------- | --------- |
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 16.291s | 38.74s |
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | | |
| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text | 12.88s | 13.84s |
| Macbook m2 max<br> | Llama3 on Ollama with nomic-embed-text | | |

**Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama). Indeed the results are the following:

Expand All @@ -22,20 +23,20 @@ Both are strored locally as txt file in .txt format because in this way we do n
**URL**: https://perinim.github.io/projects
**Task**: List me all the projects with their description.

| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 25.22 | 445 | 272 | 173 | 1 | 0.000754 |
| gpt-4-turbo-preview | 9.53 | 449 | 272 | 177 | 1 | 0.00803 |
| Grooq with nomic-embed-text | 1.99 | 474 | 284 | 190 | 1 | 0 |
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 4.132s | 438 | 303 | 135 | 1 | 0.000724 |
| gpt-4-turbo-preview | 6.965s | 442 | 303 | 139 | 1 | 0.0072 |
| gpt-4-o | 4.446s | 444 | 305 | 139 | 1 | 0 |
| Grooq with nomic-embed-text<br> | 1.335s | 648 | 482 | 166 | 1 | 0 |

### Example 2: Wired
**URL**: https://www.wired.com
**Task**: List me all the articles with their description.

| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 25.89 | 445 | 272 | 173 | 1 | 0.000754 |
| gpt-4-turbo-preview | 64.70 | 3573 | 2199 | 1374 | 1 | 0.06321 |
| Grooq with nomic-embed-text | 3.82 | 2459 | 2192 | 267 | 1 | 0 |


| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 8.836s | 1167 | 726 | 441 | 1 | 0.001971 |
| gpt-4-turbo-preview | 21.53s | 1205 | 726 | 479 | 1 | 0.02163 |
| gpt-4-o | 15.27s | 1400 | 715 | 685 | 1 | 0 |
| Grooq with nomic-embed-text<br> | 3.82s | 2459 | 2192 | 267 | 1 | 0 |
53 changes: 53 additions & 0 deletions examples/benchmarks/SmartScraper/benchmark_openai_gpt4o.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
"""
Basic example of scraping pipeline using SmartScraper from text
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info
load_dotenv()

# ************************************************
# Read the text file
# ************************************************
files = ["inputs/example_1.txt", "inputs/example_2.txt"]
tasks = ["List me all the projects with their description.",
"List me all the articles with their description."]


# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-4o",
},
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

for i in range(0, 2):
with open(files[i], 'r', encoding="utf-8") as file:
text = file.read()

smart_scraper_graph = SmartScraperGraph(
prompt=tasks[i],
source=text,
config=graph_config
)

result = smart_scraper_graph.run()
print(result)
# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
50 changes: 50 additions & 0 deletions examples/extras/custom_prompt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
"""
Basic example of scraping pipeline using SmartScraper
"""
import os
import json
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()


# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

prompt = "Some more info"

graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-3.5-turbo",
},
"additional_info": prompt,
"verbose": True,
"headless": False,
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their description",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects/",
config=graph_config,
)

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
15 changes: 15 additions & 0 deletions examples/extras/example.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"llm": {
"model": "ollama/llama3",
"temperature": 0,
"format": "json",
# "base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
# "base_url": "http://localhost:11434",
},
"verbose": true,
"headless": false
}
54 changes: 54 additions & 0 deletions examples/extras/force_mode.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""
Basic example of scraping pipeline using SmartScraper
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()


# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"model": "ollama/llama3",
"temperature": 0,
# "format": "json", # Ollama needs the format to be specified explicitly
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"force": True,
"caching": True
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their description.",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects/",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
32 changes: 32 additions & 0 deletions examples/extras/load_yml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"""
Basic example of scraping pipeline using SmartScraper
"""
import yaml
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

# ************************************************
# Define the configuration for the graph
# ************************************************
with open("example.yml", 'r') as file:
graph_config = yaml.safe_load(file)

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the titles",
source="https://sport.sky.it/nba?gr=www",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
Loading