ScrapeGraphAI · VinciGit00 · Jul 9, 2024 · Jun 19, 2024 · Jun 19, 2024 · Jun 19, 2024
diff --git a/.github/workflows/pylint.yml b/.github/workflows/pylint.yml
@@ -1,30 +1,26 @@
-on: [push]
+on:
+  push:
+    paths:
+      - 'scrapegraphai/**'
+      - '.github/workflows/pylint.yml'
 
 jobs:
   build:
     runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        python-version: ["3.10"]
     steps:
       - uses: actions/checkout@v3
-      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v3
-        with:
-          python-version: ${{ matrix.python-version }}
+      - name: Install the latest version of rye
+        uses: eifinger/setup-rye@v3
       - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          pip install pylint
-          pip install -r requirements.txt
+        run: rye sync --no-lock
       - name: Analysing the code with pylint
-        run: pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py scrapegraphai/*.py
+        run: rye run pylint-ci
       - name: Check Pylint score
         run: |
-          pylint_score=$(pylint --disable=all --enable=metrics --output-format=text scrapegraphai/**/*.py scrapegraphai/*.py | grep 'Raw metrics' | awk '{print $4}')
+          pylint_score=$(rye run pylint-score-ci | grep 'Raw metrics' | awk '{print $4}')
           if (( $(echo "$pylint_score < 8" | bc -l) )); then
             echo "Pylint score is below 8. Blocking commit."
             exit 1
           else
             echo "Pylint score is acceptable."
-          fi
+          fi
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,41 @@
+## [1.9.0-beta.2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.9.0-beta.1...v1.9.0-beta.2) (2024-07-05)
+
+
+### Bug Fixes
+
+* fix pyproject.toml ([7570bf8](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7570bf8294e49bc54ec9e296aaadb763873390ca))
+
+## [1.9.0-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.8.1-beta.1...v1.9.0-beta.1) (2024-07-04)
+
+
+### Features
+
+* add fireworks integration ([df0e310](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/df0e3108299071b849d7e055bd11d72764d24f08))
+* add integration for infos ([3bf5f57](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3bf5f570a8f8e1b037a7ad3c9f583261a1536421))
+* add integrations for markdown files ([2804434](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/2804434a9ee12c52ae8956a88b1778a4dd3ec32f))
+* add vertexai integration ([119514b](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/119514bdfc2a16dfb8918b0c34ae7cc43a01384c))
+* improve md prompt recognition ([5fe694b](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5fe694b6b4545a5091d16110318b992acfca4f58))
+
+
+### chore
+
+* **Docker:** fix port number ([afeb81f](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/afeb81f77a884799192d79dcac85666190fb1c9d))
+* **CI:** fix pylint workflow ([583c321](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/583c32106e827f50235d8fc69511652fd4b07a35))
+* **rye:** rebuild lockfiles ([27c2dd2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/27c2dd23517a7e4b14fafd00320a8b81f73145dc))
+
+## [1.8.1-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.8.0...v1.8.1-beta.1) (2024-07-04)
+
+
+### Bug Fixes
+
+* add test ([3a537ee](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3a537eec6fef1743924a9aa5cef0ba2f8d44bf11))
+
+
+### Docs
+
+* **roadmap:** fix urls ([14faba4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/14faba4f00dd9f947f8dc5e0b51be49ea684179f))
+* **roadmap:** next steps ([3e644f4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3e644f498f05eb505fbd4e94b144c81567569aaa))
+
 ## [1.8.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.7.5...v1.8.0) (2024-06-30)
 
 

diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT)
 [![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)
 
-ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
+ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).
 
 Just say which information you want to extract and the library will do it for you!
 

diff --git a/docker-compose.yml b/docker-compose.yml
@@ -4,7 +4,7 @@ services:
     image: ollama/ollama
     container_name: ollama
     ports:
-      - "5000:5000"
+      - "11434:11434"
     volumes:
       - ollama_volume:/root/.ollama
     restart: unless-stopped

diff --git a/examples/benchmarks/SmartScraper/Readme.md b/examples/benchmarks/SmartScraper/Readme.md
@@ -1,16 +1,17 @@
 # Local models
+# Local models
 The two websites benchmark are:
 - Example 1:  https://perinim.github.io/projects
 - Example 2: https://www.wired.com (at 17/4/2024)
 
 Both are strored locally as txt file in .txt format  because in this way we do not have to think about the internet connection
 
-| Hardware           | Model                                   | Example 1 | Example 2 |
-| ------------------ | --------------------------------------- | --------- | --------- |
-| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 11.60s    | 26.61s    |
-| Macbook m2 max     | Mistral on Ollama with nomic-embed-text | 8.05s     | 12.17s    |
-| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text  | 29.87s    | 35.32s    |
-| Macbook m2 max     | Llama3 on Ollama with nomic-embed-text  | 18.36s    | 78.32s    |
+| Hardware               | Model                                   | Example 1 | Example 2 |
+| ---------------------- | --------------------------------------- | --------- | --------- |
+| Macbook 14' m1 pro     | Mistral on Ollama with nomic-embed-text | 16.291s   | 38.74s    |
+| Macbook m2 max         | Mistral on Ollama with nomic-embed-text |           |           |
+| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text  | 12.88s    | 13.84s    |
+| Macbook m2 max<br>     | Llama3 on Ollama with nomic-embed-text  |           |           |
 
 **Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama). Indeed the results are the following:
 
@@ -22,20 +23,20 @@ Both are strored locally as txt file in .txt format  because in this way we do n
 **URL**: https://perinim.github.io/projects
 **Task**: List me all the projects with their description.
 
-| Name                        | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
-| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
-| gpt-3.5-turbo               | 25.22                    | 445          | 272           | 173               | 1                   | 0.000754       |
-| gpt-4-turbo-preview         | 9.53                     | 449          | 272           | 177               | 1                   | 0.00803        |
-| Grooq with nomic-embed-text | 1.99                     | 474          | 284           | 190               | 1                   | 0              |
+| Name                            | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
+| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
+| gpt-3.5-turbo                   | 4.132s                   | 438          | 303           | 135               | 1                   | 0.000724       |
+| gpt-4-turbo-preview             | 6.965s                   | 442          | 303           | 139               | 1                   | 0.0072         |
+| gpt-4-o                         | 4.446s                   | 444          | 305           | 139               | 1                   | 0              |
+| Grooq with nomic-embed-text<br> | 1.335s                   | 648          | 482           | 166               | 1                   | 0              |
 
 ### Example 2: Wired
 **URL**: https://www.wired.com
 **Task**: List me all the articles with their description.
 
-| Name                        | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
-| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
-| gpt-3.5-turbo               | 25.89                    | 445          | 272           | 173               | 1                   | 0.000754       |
-| gpt-4-turbo-preview         | 64.70                    | 3573         | 2199          | 1374              | 1                   | 0.06321        |
-| Grooq with nomic-embed-text | 3.82                     | 2459         | 2192          | 267               | 1                   | 0              |
-
-
+| Name                            | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
+| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
+| gpt-3.5-turbo                   | 8.836s                   | 1167         | 726           | 441               | 1                   | 0.001971       |
+| gpt-4-turbo-preview             | 21.53s                   | 1205         | 726           | 479               | 1                   | 0.02163        |
+| gpt-4-o                         | 15.27s                   | 1400         | 715           | 685               | 1                   | 0              |
+| Grooq with nomic-embed-text<br> | 3.82s                    | 2459         | 2192          | 267               | 1                   | 0              |
diff --git a/examples/benchmarks/SmartScraper/benchmark_openai_gpt4o.py b/examples/benchmarks/SmartScraper/benchmark_openai_gpt4o.py
@@ -0,0 +1,53 @@
+""" 
+Basic example of scraping pipeline using SmartScraper from text
+"""
+
+import os
+from dotenv import load_dotenv
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+load_dotenv()
+
+# ************************************************
+# Read the text file
+# ************************************************
+files = ["inputs/example_1.txt", "inputs/example_2.txt"]
+tasks = ["List me all the projects with their description.",
+         "List me all the articles with their description."]
+
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+
+openai_key = os.getenv("OPENAI_APIKEY")
+
+graph_config = {
+    "llm": {
+        "api_key": openai_key,
+        "model": "gpt-4o",
+    },
+}
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+for i in range(0, 2):
+    with open(files[i], 'r', encoding="utf-8") as file:
+        text = file.read()
+
+    smart_scraper_graph = SmartScraperGraph(
+        prompt=tasks[i],
+        source=text,
+        config=graph_config
+    )
+
+    result = smart_scraper_graph.run()
+    print(result)
+    # ************************************************
+    # Get graph execution info
+    # ************************************************
+
+    graph_exec_info = smart_scraper_graph.get_execution_info()
+    print(prettify_exec_info(graph_exec_info))
diff --git a/examples/extras/custom_prompt.py b/examples/extras/custom_prompt.py
@@ -0,0 +1,50 @@
+""" 
+Basic example of scraping pipeline using SmartScraper
+"""
+import os
+import json
+from dotenv import load_dotenv
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+
+load_dotenv()
+
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+
+openai_key = os.getenv("OPENAI_APIKEY")
+
+prompt = "Some more info"
+
+graph_config = {
+    "llm": {
+        "api_key": openai_key,
+        "model": "gpt-3.5-turbo",
+    },
+    "additional_info": prompt,
+    "verbose": True,
+    "headless": False,
+}
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+smart_scraper_graph = SmartScraperGraph(
+    prompt="List me all the projects with their description",
+    # also accepts a string with the already downloaded HTML code
+    source="https://perinim.github.io/projects/",
+    config=graph_config,
+)
+
+result = smart_scraper_graph.run()
+print(json.dumps(result, indent=4))
+
+# ************************************************
+# Get graph execution info
+# ************************************************
+
+graph_exec_info = smart_scraper_graph.get_execution_info()
+print(prettify_exec_info(graph_exec_info))
diff --git a/examples/extras/example.yml b/examples/extras/example.yml
@@ -0,0 +1,15 @@
+{
+    "llm": {
+        "model": "ollama/llama3",
+        "temperature": 0,
+        "format": "json",
+        # "base_url": "http://localhost:11434", 
+    },
+    "embeddings": {
+        "model": "ollama/nomic-embed-text",
+        "temperature": 0,
+        # "base_url": "http://localhost:11434",  
+    },
+    "verbose": true,
+    "headless": false
+}
diff --git a/examples/extras/force_mode.py b/examples/extras/force_mode.py
@@ -0,0 +1,54 @@
+""" 
+Basic example of scraping pipeline using SmartScraper
+"""
+
+import os
+from dotenv import load_dotenv
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+
+load_dotenv()
+
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+
+openai_key = os.getenv("OPENAI_APIKEY")
+
+graph_config = {
+  "llm": {
+        "model": "ollama/llama3",
+        "temperature": 0,
+        # "format": "json",  # Ollama needs the format to be specified explicitly
+        # "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
+    },
+    "embeddings": {
+        "model": "ollama/nomic-embed-text",
+        "temperature": 0,
+        # "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
+    },
+    "force": True,
+    "caching": True
+}
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+smart_scraper_graph = SmartScraperGraph(
+    prompt="List me all the projects with their description.",
+    # also accepts a string with the already downloaded HTML code
+    source="https://perinim.github.io/projects/",
+    config=graph_config
+)
+
+result = smart_scraper_graph.run()
+print(result)
+
+# ************************************************
+# Get graph execution info
+# ************************************************
+
+graph_exec_info = smart_scraper_graph.get_execution_info()
+print(prettify_exec_info(graph_exec_info))
diff --git a/examples/extras/load_yml.py b/examples/extras/load_yml.py
@@ -0,0 +1,32 @@
+""" 
+Basic example of scraping pipeline using SmartScraper
+"""
+import yaml
+from scrapegraphai.graphs import SmartScraperGraph
+from scrapegraphai.utils import prettify_exec_info
+
+# ************************************************
+# Define the configuration for the graph
+# ************************************************
+with open("example.yml", 'r') as file:
+    graph_config = yaml.safe_load(file)
+
+# ************************************************
+# Create the SmartScraperGraph instance and run it
+# ************************************************
+
+smart_scraper_graph = SmartScraperGraph(
+    prompt="List me all the titles",
+    source="https://sport.sky.it/nba?gr=www",
+    config=graph_config
+)
+
+result = smart_scraper_graph.run()
+print(result)
+
+# ************************************************
+# Get graph execution info
+# ************************************************
+
+graph_exec_info = smart_scraper_graph.get_execution_info()
+print(prettify_exec_info(graph_exec_info))