543 script creator graph only use first chunk #619

tm-robinson · 2024-09-02T07:43:32Z

This is an initial fix for #543 when using ScriptCreatorGraph. Because ParseNode provides multiple chunks when scraping long HTML pages, I have updated ScriptCreatorGraph to only use the first chunk, on the basis the structure of the remaining page is likely similar, therefore the generated script will work.

This seems to be working fine when using this scraper script:

from scrapegraphai.graphs import ScriptCreatorGraph

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "api_key": "sk-proj-xxxxxx",
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
    "library": "beautifulsoup"
}

url = 'https://old.reddit.com/r/gaming/comments/1exyrjv/ah_monkey_business/'
script_creator_graph = ScriptCreatorGraph(
    prompt="Get ALL comment that appear on this page, including the contents of the comment, the author, "
    "the score, and any child comments. Return as a hierarchical data structure.",
    source=url,
    config=graph_config
)

result = script_creator_graph.run()

 # Remove the first and last line
cleaned_content = result.strip().split('\n')[1:-1]

# Join the cleaned lines back into a string
cleaned_content = '\n'.join(cleaned_content)
print(cleaned_content)
with open('output.py', 'w') as file:
    file.write(cleaned_content)

The code above results in the following script being generated:

import requests
from bs4 import BeautifulSoup
import json

def main():
    url = "https://old.reddit.com/r/gaming/comments/1exyrjv/ah_monkey_business/"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    comments_data = []

    for comment in soup.find_all('div', class_='thing'):
        comment_info = {}
        author = comment.find('a', class_='author')
        score = comment.find('span', class_='score')
        content = comment.find('div', class_='usertext-body')
        
        if author and score and content:
            comment_info['author'] = author.text
            comment_info['score'] = int(score['title'])
            comment_info['content'] = content.get_text(strip=True)
            comment_info['children'] = []

            # Check for child comments
            child_comments = comment.find_all('div', class_='child')
            for child in child_comments:
                child_info = {}
                child_author = child.find('a', class_='author')
                child_score = child.find('span', class_='score')
                child_content = child.find('div', class_='usertext-body')

                if child_author and child_score and child_content:
                    child_info['author'] = child_author.text
                    child_info['score'] = int(child_score['title'])
                    child_info['content'] = child_content.get_text(strip=True)
                    comment_info['children'].append(child_info)

            comments_data.append(comment_info)

    print(json.dumps(comments_data, indent=4))

if __name__ == "__main__":
    main()

The generated script seems to work fine. I think it's worth going with this solution for now, as I suspect it will work fine most of the time, and when I have a bit of spare time I will look at the idea of creating a script for each chunk and asking the LLM to merge the scripts.

fix: deepcopy fail for coping model_instance config

## [1.16.0](ScrapeGraphAI/Scrapegraph-ai@v1.15.2...v1.16.0) (2024-09-01) ### Features * add deepcopy error ([71b22d4](ScrapeGraphAI@71b22d4)) ### Bug Fixes * deepcopy fail for coping model_instance config ([cd07418](ScrapeGraphAI@cd07418)) * fix pydantic object copy ([553527a](ScrapeGraphAI@553527a))

github-actions · 2024-09-02T09:37:32Z

🎉 This issue has been resolved in version 1.17.0-beta.1 🎉

The release is available on:

v1.17.0-beta.1
GitHub release

Your semantic-release bot 📦🚀

github-actions · 2024-09-08T11:14:10Z

🎉 This PR is included in version 1.19.0-beta.1 🎉

The release is available on:

v1.19.0-beta.1
GitHub release

Your semantic-release bot 📦🚀

github-actions · 2024-09-14T08:56:31Z

🎉 This PR is included in version 1.20.0-beta.1 🎉

The release is available on:

v1.20.0-beta.1
GitHub release

Your semantic-release bot 📦🚀

github-actions · 2024-09-19T08:14:29Z

🎉 This PR is included in version 1.21.0 🎉

The release is available on:

v1.21.0
GitHub release

Your semantic-release bot 📦🚀

VinciGit00 and others added 4 commits September 1, 2024 11:54

Merge pull request ScrapeGraphAI#613 from goasleep/feature/add_copy_tool

afdf524

fix: deepcopy fail for coping model_instance config

updated token calculation on parsenode

a8b0e4a

change GenerateScraperNode to only use first chunk

3d265a8

VinciGit00 changed the base branch from main to pre/beta September 2, 2024 09:32

Merge branch 'pre/beta' into 543-ScriptCreatorGraph-only-use-first-chunk

e741602

VinciGit00 merged commit fd0a902 into ScrapeGraphAI:pre/beta Sep 2, 2024

github-actions bot added the released on @dev label Sep 2, 2024

tm-robinson deleted the 543-ScriptCreatorGraph-only-use-first-chunk branch September 3, 2024 07:12

tm-robinson restored the 543-ScriptCreatorGraph-only-use-first-chunk branch September 3, 2024 08:04

github-actions bot added the released on @stable label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

543 script creator graph only use first chunk #619

543 script creator graph only use first chunk #619

Uh oh!

tm-robinson commented Sep 2, 2024

Uh oh!

github-actions bot commented Sep 2, 2024

Uh oh!

github-actions bot commented Sep 8, 2024

Uh oh!

github-actions bot commented Sep 14, 2024

Uh oh!

github-actions bot commented Sep 19, 2024

Uh oh!

Uh oh!

Uh oh!

543 script creator graph only use first chunk #619

543 script creator graph only use first chunk #619

Uh oh!

Conversation

tm-robinson commented Sep 2, 2024

Uh oh!

github-actions bot commented Sep 2, 2024

Uh oh!

github-actions bot commented Sep 8, 2024

Uh oh!

github-actions bot commented Sep 14, 2024

Uh oh!

github-actions bot commented Sep 19, 2024

Uh oh!

Uh oh!