Skip to content

543 script creator graph only use first chunk #619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

tm-robinson
Copy link
Contributor

This is an initial fix for #543 when using ScriptCreatorGraph. Because ParseNode provides multiple chunks when scraping long HTML pages, I have updated ScriptCreatorGraph to only use the first chunk, on the basis the structure of the remaining page is likely similar, therefore the generated script will work.

This seems to be working fine when using this scraper script:

from scrapegraphai.graphs import ScriptCreatorGraph

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "api_key": "sk-proj-xxxxxx",
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
    "library": "beautifulsoup"
}

url = 'https://old.reddit.com/r/gaming/comments/1exyrjv/ah_monkey_business/'
script_creator_graph = ScriptCreatorGraph(
    prompt="Get ALL comment that appear on this page, including the contents of the comment, the author, "
    "the score, and any child comments. Return as a hierarchical data structure.",
    source=url,
    config=graph_config
)

result = script_creator_graph.run()

 # Remove the first and last line
cleaned_content = result.strip().split('\n')[1:-1]

# Join the cleaned lines back into a string
cleaned_content = '\n'.join(cleaned_content)
print(cleaned_content)
with open('output.py', 'w') as file:
    file.write(cleaned_content)

The code above results in the following script being generated:

import requests
from bs4 import BeautifulSoup
import json

def main():
    url = "https://old.reddit.com/r/gaming/comments/1exyrjv/ah_monkey_business/"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    comments_data = []

    for comment in soup.find_all('div', class_='thing'):
        comment_info = {}
        author = comment.find('a', class_='author')
        score = comment.find('span', class_='score')
        content = comment.find('div', class_='usertext-body')
        
        if author and score and content:
            comment_info['author'] = author.text
            comment_info['score'] = int(score['title'])
            comment_info['content'] = content.get_text(strip=True)
            comment_info['children'] = []

            # Check for child comments
            child_comments = comment.find_all('div', class_='child')
            for child in child_comments:
                child_info = {}
                child_author = child.find('a', class_='author')
                child_score = child.find('span', class_='score')
                child_content = child.find('div', class_='usertext-body')

                if child_author and child_score and child_content:
                    child_info['author'] = child_author.text
                    child_info['score'] = int(child_score['title'])
                    child_info['content'] = child_content.get_text(strip=True)
                    comment_info['children'].append(child_info)

            comments_data.append(comment_info)

    print(json.dumps(comments_data, indent=4))

if __name__ == "__main__":
    main()

The generated script seems to work fine. I think it's worth going with this solution for now, as I suspect it will work fine most of the time, and when I have a bit of spare time I will look at the idea of creating a script for each chunk and asking the LLM to merge the scripts.

VinciGit00 and others added 4 commits September 1, 2024 11:54
fix: deepcopy fail for coping model_instance config
## [1.16.0](ScrapeGraphAI/Scrapegraph-ai@v1.15.2...v1.16.0) (2024-09-01)

### Features

* add deepcopy error ([71b22d4](ScrapeGraphAI@71b22d4))

### Bug Fixes

* deepcopy fail for coping model_instance config ([cd07418](ScrapeGraphAI@cd07418))
* fix pydantic object copy ([553527a](ScrapeGraphAI@553527a))
@VinciGit00 VinciGit00 changed the base branch from main to pre/beta September 2, 2024 09:32
@VinciGit00 VinciGit00 merged commit fd0a902 into ScrapeGraphAI:pre/beta Sep 2, 2024
Copy link

github-actions bot commented Sep 2, 2024

🎉 This issue has been resolved in version 1.17.0-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@tm-robinson tm-robinson deleted the 543-ScriptCreatorGraph-only-use-first-chunk branch September 3, 2024 07:12
@tm-robinson tm-robinson restored the 543-ScriptCreatorGraph-only-use-first-chunk branch September 3, 2024 08:04
Copy link

github-actions bot commented Sep 8, 2024

🎉 This PR is included in version 1.19.0-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Copy link

🎉 This PR is included in version 1.20.0-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Copy link

🎉 This PR is included in version 1.21.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants