Skip to content

allignment #363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 29 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
4559ab6
docs: add Japanese README
eltociear Jun 5, 2024
871e398
docs: update README.md
eltociear Jun 5, 2024
f0042a8
docs: update japanese.md
eltociear Jun 5, 2024
04b0352
Merge pull request #345 from eltociear/add-japanese-readme
VinciGit00 Jun 5, 2024
1d38ed1
fix: bug on generate_answer_node
VinciGit00 Jun 5, 2024
3629215
ci(release): 1.5.5 [skip ci]
semantic-release-bot Jun 5, 2024
4e16c9a
support ernie
duke147 Jun 5, 2024
67d83cf
fix: getter
VinciGit00 Jun 5, 2024
49cdadf
ci(release): 1.5.6 [skip ci]
semantic-release-bot Jun 5, 2024
2ef6d67
Merge pull request #346 from duke147/ernie
VinciGit00 Jun 5, 2024
2b2b910
support ernie
duke147 Jun 5, 2024
1a404e3
Merge remote-tracking branch 'upstream/main' into ernie
duke147 Jun 5, 2024
9572578
add earnie example
VinciGit00 Jun 5, 2024
9ef73d7
Merge pull request #347 from duke147/ernie
VinciGit00 Jun 5, 2024
d772453
Refactor model_name attribute access in llm_model in robots_node.py
tindo1234 Jun 5, 2024
e7af5ea
Merge pull request #348 from tindo2003/fix_robots_node
VinciGit00 Jun 5, 2024
10672d6
fix: update openai tts class
VinciGit00 Jun 6, 2024
c17daca
ci(release): 1.5.7 [skip ci]
semantic-release-bot Jun 6, 2024
d845a1b
test: Enhance JSON scraping pipeline test
tejhande Jun 7, 2024
261c4fc
Merge pull request #352 from tejhande/patch-1
VinciGit00 Jun 7, 2024
320f13f
Enhance tests for FetchNode with mocking
tejhande Jun 7, 2024
ff9df81
Test ScriptCreatorGraph and print execution info
tejhande Jun 7, 2024
c78aa43
beautofy readmes
VinciGit00 Jun 8, 2024
5dc6165
add example
VinciGit00 Jun 9, 2024
14d1011
Merge pull request #354 from tejhande/patch-2
VinciGit00 Jun 9, 2024
dedfa2e
feat: Add tests for RobotsNode and update test setup
tejhande Jun 9, 2024
2781c3c
Merge pull request #355 from tejhande/patch-3
VinciGit00 Jun 9, 2024
e688480
Merge pull request #362 from tejhande/patch-4
VinciGit00 Jun 9, 2024
58086ee
ci(release): 1.6.0 [skip ci]
semantic-release-bot Jun 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,43 @@
## [1.6.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.7...v1.6.0) (2024-06-09)


### Features

* Add tests for RobotsNode and update test setup ([dedfa2e](https://github.com/VinciGit00/Scrapegraph-ai/commit/dedfa2eaf02b7e9b68a116515053c1daae6e4a31))


### Test

* Enhance JSON scraping pipeline test ([d845a1b](https://github.com/VinciGit00/Scrapegraph-ai/commit/d845a1ba7d6e7f7574b92b51b6d5326bbfb3d1c6))

## [1.5.7](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.6...v1.5.7) (2024-06-06)


### Bug Fixes

* update openai tts class ([10672d6](https://github.com/VinciGit00/Scrapegraph-ai/commit/10672d6ebb06d950bbf8b66cc9a2d420c183013d))

## [1.5.6](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.5...v1.5.6) (2024-06-05)


### Bug Fixes

* getter ([67d83cf](https://github.com/VinciGit00/Scrapegraph-ai/commit/67d83cff46d8ea606b8972c364ab4c56e6fa4fe4))

## [1.5.5](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.4...v1.5.5) (2024-06-05)


### Bug Fixes

* bug on generate_answer_node ([1d38ed1](https://github.com/VinciGit00/Scrapegraph-ai/commit/1d38ed146afae95dae1f35ac51180a1882bf8a29))


### Docs

* add Japanese README ([4559ab6](https://github.com/VinciGit00/Scrapegraph-ai/commit/4559ab6db845a0d94371a09d0ed1e1623eed9ee2))
* update japanese.md ([f0042a8](https://github.com/VinciGit00/Scrapegraph-ai/commit/f0042a8e33f8fb8b113681ee0a9995d329bb0faa))
* update README.md ([871e398](https://github.com/VinciGit00/Scrapegraph-ai/commit/871e398a26786d264dbd1b2743864ed2cc12b3da))

## [1.5.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.3...v1.5.4) (2024-05-31)


Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

# 🕷️ ScrapeGraphAI: You Only Scrape Once
[English](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/README.md) | [中文](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/chinese.md)
[English](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/README.md) | [中文](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/chinese.md) | [日本語](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/japanese.md)

[![Downloads](https://static.pepy.tech/badge/scrapegraphai)](https://pepy.tech/project/scrapegraphai)
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/pylint-dev/pylint)
Expand Down
10 changes: 5 additions & 5 deletions docs/chinese.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# 🕷️ ScrapeGraphAI: 只需抓取一次
[![下载量](https://static.pepy.tech/badge/scrapegraphai)](https://pepy.tech/project/scrapegraphai)
[![代码检查: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/pylint-dev/pylint)
[![Pylint](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml/badge.svg)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml)
[![CodeQL](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml/badge.svg)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml)
[![许可证: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Downloads](https://img.shields.io/pepy/dt/scrapegraphai?style=for-the-badge)](https://pepy.tech/project/scrapegraphai)
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen?style=for-the-badge)](https://github.com/pylint-dev/pylint)
[![Pylint](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/pylint.yml?style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml)
[![CodeQL](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/codeql.yml?style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT)
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)

ScrapeGraphAI 是一个*网络爬虫* Python 库,使用大型语言模型和直接图逻辑为网站和本地文档(XML,HTML,JSON 等)创建爬取管道。
Expand Down
225 changes: 225 additions & 0 deletions docs/japanese.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# 🕷️ ScrapeGraphAI: 一度のクロールで完結
[![Downloads](https://img.shields.io/pepy/dt/scrapegraphai?style=for-the-badge)](https://pepy.tech/project/scrapegraphai)
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen?style=for-the-badge)](https://github.com/pylint-dev/pylint)
[![Pylint](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/pylint.yml?style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml)
[![CodeQL](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/codeql.yml?style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT)
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)

ScrapeGraphAIは、大規模言語モデルと直接グラフロジックを使用して、ウェブサイトやローカルドキュメント(XML、HTML、JSONなど)のクローリングパイプラインを作成するPythonライブラリです。

クロールしたい情報をライブラリに伝えるだけで、残りはすべてライブラリが行います!

<p align="center">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
</p>

## 🚀 インストール方法

Scrapegraph-aiの参照ページはPyPIの公式サイトで見ることができます: [pypi](https://pypi.org/project/scrapegraphai/)。

```bash
pip install scrapegraphai
```
**注意**: 他のライブラリとの競合を避けるため、このライブラリは仮想環境でのインストールを推奨します 🐱

## 🔍 デモ

公式のStreamlitデモ:

[![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-web-dashboard.streamlit.app)

Google Colabで直接試す:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)

## 📖 ドキュメント

ScrapeGraphAIのドキュメントは[こちら](https://scrapegraph-ai.readthedocs.io/en/latest/)で見ることができます。

Docusaurusの[バージョン](https://scrapegraph-doc.onrender.com/)もご覧ください。

## 💻 使い方

ウェブサイト(またはローカルファイル)から情報を抽出するための3つの主要なクローリングパイプラインがあります:

- `SmartScraperGraph`: 単一ページのクローラー。ユーザープロンプトと入力ソースのみが必要です。
- `SearchGraph`: 複数ページのクローラー。検索エンジンの上位n個の検索結果から情報を抽出します。
- `SpeechGraph`: 単一ページのクローラー。ウェブサイトから情報を抽出し、音声ファイルを生成します。
- `SmartScraperMultiGraph`: 複数ページのクローラー。プロンプトを与えると、
**OpenAI**、**Groq**、**Azure**、**Gemini**などの異なるLLMをAPI経由で使用することができます。また、**Ollama**のローカルモデルを使用することもできます。

### 例 1: ローカルモデルを使用したSmartScraper
[Ollama](https://ollama.com/)がインストールされていること、および`ollama pull`コマンドでモデルがダウンロードされていることを確認してください。

``` python
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json", # Ollamaではフォーマットを明示的に指定する必要があります
"base_url": "http://localhost:11434", # OllamaのURLを設定
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434", # OllamaのURLを設定
},
"verbose": True,
}

smart_scraper_graph = SmartScraperGraph(
prompt="すべてのプロジェクトとその説明をリストしてください",
# ダウンロード済みのHTMLコードの文字列も受け付けます
source="https://perinim.github.io/projects",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)
```

出力は、プロジェクトとその説明のリストになります:

```python
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
```

### 例 2: 混合モデルを使用したSearchGraph
**Groq**をLLMとして、**Ollama**を埋め込みモデルとして使用します。

```python
from scrapegraphai.graphs import SearchGraph

# グラフの設定を定義
graph_config = {
"llm": {
"model": "groq/gemma-7b-it",
"api_key": "GROQ_API_KEY",
"temperature": 0
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434", # OllamaのURLを任意に設定
},
"max_results": 5,
}

# SearchGraphインスタンスを作成
search_graph = SearchGraph(
prompt="Chioggiaの伝統的なレシピをすべてリストしてください",
config=graph_config
)

# グラフを実行
result = search_graph.run()
print(result)
```

出力は、レシピのリストになります:

```python
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
```

### 例 3: OpenAIを使用したSpeechGraph

OpenAI APIキーとモデル名を渡すだけです。

```python
from scrapegraphai.graphs import SpeechGraph

graph_config = {
"llm": {
"api_key": "OPENAI_API_KEY",
"model": "gpt-3.5-turbo",
},
"tts_model": {
"api_key": "OPENAI_API_KEY",
"model": "tts-1",
"voice": "alloy"
},
"output_path": "audio_summary.mp3",
}

# ************************************************
# SpeechGraphインスタンスを作成して実行
# ************************************************

speech_graph = SpeechGraph(
prompt="プロジェクトの詳細な音声要約を作成してください。",
source="https://perinim.github.io/projects/",
config=graph_config,
)

result = speech_graph.run()
print(result)
```
出力は、ページ上のプロジェクトの要約を含む音声ファイルになります。

## スポンサー

<div style="text-align: center;">
<a href="https://serpapi.com?utm_source=scrapegraphai">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
</a>
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
</a>
</div>

## 🤝 貢献

貢献を歓迎し、Discordサーバーで改善や提案について話し合います!

[貢献ガイド](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md)をご覧ください。

[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)


## 📈 ロードマップ

[こちら](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/README.md)でプロジェクトのロードマップをご覧ください! 🚀

よりインタラクティブな方法でロードマップを視覚化したいですか?[markmap](https://markmap.js.org/repl)をチェックして、マークダウンの内容をエディタにコピー&ペーストして視覚化してください!

## ❤️ 貢献者
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)


## 🎓 引用

研究目的で当社のライブラリを使用する場合は、以下の参考文献を引用してください:
```text
@misc{scrapegraph-ai,
author = {Marco Perini, Lorenzo Padoan, Marco Vinciguerra},
title = {Scrapegraph-ai},
year = {2024},
url = {https://github.com/VinciGit00/Scrapegraph-ai},
note = {A Python library for scraping leveraging large language models}
}
```
## 作者

<p align="center">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors_logos">
</p>

## 連絡先
| | 連絡先 |
|--------------------|----------------------|
| Marco Vinciguerra | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/marco-vinciguerra-7ba365242/) |
| Marco Perini | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/perinim/) |
| Lorenzo Padoan | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/lorenzo-padoan-4521a2154/) |

## 📜 ライセンス

ScrapeGraphAIはMITライセンスの下で提供されています。詳細は[LICENSE](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/LICENSE)ファイルをご覧ください。

## 謝辞

- プロジェクトの貢献者とオープンソースコミュニティのサポートに感謝します。
- ScrapeGraphAIはデータ探索と研究目的のみに使用されます。このライブラリの不正使用については一切責任を負いません。
18 changes: 0 additions & 18 deletions examples/anthropic/pdf_scraper_graph_haiku.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,28 +28,10 @@
the Beatrice of his earlier poetry, through the celestial spheres of Paradise.
"""

schema = """
{
"type": "object",
"properties": {
"summary": {
"type": "string"
},
"topics": {
"type": "array",
"items": {
"type": "string"
}
}
}
}
"""

pdf_scraper_graph = PDFScraperGraph(
prompt="Summarize the text and find the main topics",
source=source,
config=graph_config,
schema=schema,
)
result = pdf_scraper_graph.run()

Expand Down
1 change: 0 additions & 1 deletion examples/anthropic/smart_scraper_haiku.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@


# required environment variables in .env
# HUGGINGFACEHUB_API_TOKEN
# ANTHROPIC_API_KEY
load_dotenv()

Expand Down
61 changes: 61 additions & 0 deletions examples/ernie/csv_scraper_ernie.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
"""
Basic example of scraping pipeline using CSVScraperGraph from CSV documents
"""

import os
from dotenv import load_dotenv
import pandas as pd
from scrapegraphai.graphs import CSVScraperGraph
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info
load_dotenv()

# ************************************************
# Read the CSV file
# ************************************************

FILE_NAME = "inputs/username.csv"
curr_dir = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(curr_dir, FILE_NAME)

text = pd.read_csv(file_path)

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
"llm": {
"model": "ernie-bot-turbo",
"ernie_client_id": "<ernie_client_id>",
"ernie_client_secret": "<ernie_client_secret>",
"temperature": 0.1
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
"base_url": "http://localhost:11434",}
}

# ************************************************
# Create the CSVScraperGraph instance and run it
# ************************************************

csv_scraper_graph = CSVScraperGraph(
prompt="List me all the last names",
source=str(text), # Pass the content of the file, not the file object
config=graph_config
)

result = csv_scraper_graph.run()
print(result)

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = csv_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

# Save to json or csv
convert_to_csv(result, "result")
convert_to_json(result, "result")
Loading