Skip to content

feat: add integrations for markdown files #417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jul 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT)
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)

ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).

Just say which information you want to extract and the library will do it for you!

Expand Down
37 changes: 19 additions & 18 deletions examples/benchmarks/SmartScraper/Readme.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
# Local models
# Local models
The two websites benchmark are:
- Example 1: https://perinim.github.io/projects
- Example 2: https://www.wired.com (at 17/4/2024)

Both are strored locally as txt file in .txt format because in this way we do not have to think about the internet connection

| Hardware | Model | Example 1 | Example 2 |
| ------------------ | --------------------------------------- | --------- | --------- |
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 11.60s | 26.61s |
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | 8.05s | 12.17s |
| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text | 29.87s | 35.32s |
| Macbook m2 max | Llama3 on Ollama with nomic-embed-text | 18.36s | 78.32s |
| Hardware | Model | Example 1 | Example 2 |
| ---------------------- | --------------------------------------- | --------- | --------- |
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 16.291s | 38.74s |
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | | |
| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text | 12.88s | 13.84s |
| Macbook m2 max<br> | Llama3 on Ollama with nomic-embed-text | | |

**Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama). Indeed the results are the following:

Expand All @@ -22,20 +23,20 @@ Both are strored locally as txt file in .txt format because in this way we do n
**URL**: https://perinim.github.io/projects
**Task**: List me all the projects with their description.

| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 25.22 | 445 | 272 | 173 | 1 | 0.000754 |
| gpt-4-turbo-preview | 9.53 | 449 | 272 | 177 | 1 | 0.00803 |
| Grooq with nomic-embed-text | 1.99 | 474 | 284 | 190 | 1 | 0 |
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 4.132s | 438 | 303 | 135 | 1 | 0.000724 |
| gpt-4-turbo-preview | 6.965s | 442 | 303 | 139 | 1 | 0.0072 |
| gpt-4-o | 4.446s | 444 | 305 | 139 | 1 | 0 |
| Grooq with nomic-embed-text<br> | 1.335s | 648 | 482 | 166 | 1 | 0 |

### Example 2: Wired
**URL**: https://www.wired.com
**Task**: List me all the articles with their description.

| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 25.89 | 445 | 272 | 173 | 1 | 0.000754 |
| gpt-4-turbo-preview | 64.70 | 3573 | 2199 | 1374 | 1 | 0.06321 |
| Grooq with nomic-embed-text | 3.82 | 2459 | 2192 | 267 | 1 | 0 |


| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
| gpt-3.5-turbo | 8.836s | 1167 | 726 | 441 | 1 | 0.001971 |
| gpt-4-turbo-preview | 21.53s | 1205 | 726 | 479 | 1 | 0.02163 |
| gpt-4-o | 15.27s | 1400 | 715 | 685 | 1 | 0 |
| Grooq with nomic-embed-text<br> | 3.82s | 2459 | 2192 | 267 | 1 | 0 |
53 changes: 53 additions & 0 deletions examples/benchmarks/SmartScraper/benchmark_openai_gpt4o.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
"""
Basic example of scraping pipeline using SmartScraper from text
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info
load_dotenv()

# ************************************************
# Read the text file
# ************************************************
files = ["inputs/example_1.txt", "inputs/example_2.txt"]
tasks = ["List me all the projects with their description.",
"List me all the articles with their description."]


# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-4o",
},
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

for i in range(0, 2):
with open(files[i], 'r', encoding="utf-8") as file:
text = file.read()

smart_scraper_graph = SmartScraperGraph(
prompt=tasks[i],
source=text,
config=graph_config
)

result = smart_scraper_graph.run()
print(result)
# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
15 changes: 15 additions & 0 deletions examples/extras/example.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"llm": {
"model": "ollama/llama3",
"temperature": 0,
"format": "json",
# "base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
# "base_url": "http://localhost:11434",
},
"verbose": true,
"headless": false
}
54 changes: 54 additions & 0 deletions examples/extras/force_mode.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""
Basic example of scraping pipeline using SmartScraper
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()


# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"model": "ollama/llama3",
"temperature": 0,
# "format": "json", # Ollama needs the format to be specified explicitly
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"force": True,
"caching": True
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their description.",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects/",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
32 changes: 32 additions & 0 deletions examples/extras/load_yml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"""
Basic example of scraping pipeline using SmartScraper
"""
import yaml
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

# ************************************************
# Define the configuration for the graph
# ************************************************
with open("example.yml", 'r') as file:
graph_config = yaml.safe_load(file)

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the titles",
source="https://sport.sky.it/nba?gr=www",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
43 changes: 43 additions & 0 deletions examples/extras/no_cut.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
"""
This example shows how to do not process the html code in the fetch phase
"""

import os, json
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info


# ************************************************
# Define the configuration for the graph
# ************************************************


graph_config = {
"llm": {
"api_key": "s",
"model": "gpt-3.5-turbo",
},
"cut": False,
"verbose": True,
"headless": False,
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
prompt="Extract me the python code inside the page",
source="https://www.exploit-db.com/exploits/51447",
config=graph_config
)

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
48 changes: 48 additions & 0 deletions examples/extras/proxy_rotation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
"""
Basic example of scraping pipeline using SmartScraper
"""

from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info


# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
"llm": {
"api_key": "API_KEY",
"model": "gpt-3.5-turbo",
},
"loader_kwargs": {
"proxy" : {
"server": "http:/**********",
"username": "********",
"password": "***",
},
},
"verbose": True,
"headless": False,
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their description",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects/",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
46 changes: 46 additions & 0 deletions examples/extras/rag_caching.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
"""
Basic example of scraping pipeline using SmartScraper
"""

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()


# ************************************************
# Define the configuration for the graph
# ************************************************

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-3.5-turbo",
},
"caching": True
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their description.",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects/",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
Loading
Loading