From 65add5f46c0dadc09abcb2467029dbd094d1c610 Mon Sep 17 00:00:00 2001 From: semantic-release-bot Date: Mon, 4 Nov 2024 08:14:43 +0000 Subject: [PATCH 1/3] ci(release): 1.29.0 [skip ci] ## [1.29.0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.28.0...v1.29.0) (2024-11-04) ### Features * Serper API integration for Google search ([c218546](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/c218546a3ddbdf987888e150942a244856af66cc)) ### Bug Fixes * resolved outparser issue ([e8cabfd](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/e8cabfd1ae7cc93abc04745948db1f6933fd2e26)) ### CI * **release:** 1.28.0-beta.3 [skip ci] ([65d39bb](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/65d39bbaf0671fa5ac84705e94adb42078a36c3b)) * **release:** 1.28.0-beta.4 [skip ci] ([b90bb00](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/b90bb00beb8497b8dd16fa4d1ef5af22042a55f3)) * **release:** 1.29.0-beta.1 [skip ci] ([950e859](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/950e859b1b90c7d5b85cbfcb0948e93d4487f78d)) --- CHANGELOG.md | 19 +++++++++++++++++++ pyproject.toml | 2 +- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0de76f18..3cba3b99 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,22 @@ +## [1.29.0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.28.0...v1.29.0) (2024-11-04) + + +### Features + +* Serper API integration for Google search ([c218546](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/c218546a3ddbdf987888e150942a244856af66cc)) + + +### Bug Fixes + +* resolved outparser issue ([e8cabfd](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/e8cabfd1ae7cc93abc04745948db1f6933fd2e26)) + + +### CI + +* **release:** 1.28.0-beta.3 [skip ci] ([65d39bb](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/65d39bbaf0671fa5ac84705e94adb42078a36c3b)) +* **release:** 1.28.0-beta.4 [skip ci] ([b90bb00](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/b90bb00beb8497b8dd16fa4d1ef5af22042a55f3)) +* **release:** 1.29.0-beta.1 [skip ci] ([950e859](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/950e859b1b90c7d5b85cbfcb0948e93d4487f78d)) + ## [1.29.0-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.28.0...v1.29.0-beta.1) (2024-11-04) diff --git a/pyproject.toml b/pyproject.toml index 88fed28e..49158ab5 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -2,7 +2,7 @@ name = "scrapegraphai" -version = "1.29.0b1" +version = "1.29.0" From 60f673dc39cba70706291e11211b9ad180860e24 Mon Sep 17 00:00:00 2001 From: aliwert <154356044+aliwert@users.noreply.github.com> Date: Tue, 5 Nov 2024 00:16:07 +0300 Subject: [PATCH 2/3] feat: Turkish language support has been added to README.md --- README.md | 48 ++++++----- docs/turkish.md | 208 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 235 insertions(+), 21 deletions(-) create mode 100644 docs/turkish.md diff --git a/README.md b/README.md index 94beb617..3ed310df 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,8 @@ - # 🕷️ ScrapeGraphAI: You Only Scrape Once + [English](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/README.md) | [中文](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/chinese.md) | [日本語](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/japanese.md) | [한국어](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/korean.md) -| [Русский](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/russian.md) - +| [Русский](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/russian.md) | [Türkçe](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/turkish.md) [![Downloads](https://img.shields.io/pepy/dt/scrapegraphai?style=for-the-badge)](https://pepy.tech/project/scrapegraphai) [![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen?style=for-the-badge)](https://github.com/pylint-dev/pylint) @@ -12,7 +11,7 @@ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT) [![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX) -ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). +ScrapeGraphAI is a _web scraping_ python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you! @@ -39,9 +38,11 @@ Additional dependecies can be added while installing the library: - More Language Models: additional language models are installed, such as Fireworks, Groq, Anthropic, Hugging Face, and Nvidia AI Endpoints. This group allows you to use additional language models like Fireworks, Groq, Anthropic, Together AI, Hugging Face, and Nvidia AI Endpoints. + ```bash pip install scrapegraphai[other-language-models] ``` + - Semantic Options: this group includes tools for advanced semantic processing, such as Graphviz. ```bash @@ -56,13 +57,12 @@ Additional dependecies can be added while installing the library: - ## 💻 Usage + There are multiple standard scraping pipelines that can be used to extract information from a website (or local file). The most common one is the `SmartScraperGraph`, which extracts information from a single page given a user prompt and a source URL. - ```python import json from scrapegraphai.graphs import SmartScraperGraph @@ -98,16 +98,17 @@ The output will be a dictionary like the following: "contact_email": "contact@scrapegraphai.com" } ``` + There are other pipelines that can be used to extract information from multiple pages, generate Python scripts, or even generate audio files. -| Pipeline Name | Description | -|-------------------------|------------------------------------------------------------------------------------------------------------------| -| SmartScraperGraph | Single-page scraper that only needs a user prompt and an input source. | -| SearchGraph | Multi-page scraper that extracts information from the top n search results of a search engine. | -| SpeechGraph | Single-page scraper that extracts information from a website and generates an audio file. | -| ScriptCreatorGraph | Single-page scraper that extracts information from a website and generates a Python script. | -| SmartScraperMultiGraph | Multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources. | -| ScriptCreatorMultiGraph | Multi-page scraper that generates a Python script for extracting information from multiple pages and sources. | +| Pipeline Name | Description | +| ----------------------- | ------------------------------------------------------------------------------------------------------------- | +| SmartScraperGraph | Single-page scraper that only needs a user prompt and an input source. | +| SearchGraph | Multi-page scraper that extracts information from the top n search results of a search engine. | +| SpeechGraph | Single-page scraper that extracts information from a website and generates an audio file. | +| ScriptCreatorGraph | Single-page scraper that extracts information from a website and generates a Python script. | +| SmartScraperMultiGraph | Multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources. | +| ScriptCreatorMultiGraph | Multi-page scraper that generates a Python script for extracting information from multiple pages and sources. | For each of these graphs there is the multi version. It allows to make calls of the LLM in parallel. @@ -116,6 +117,7 @@ It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command, if you want to use local models. ## 🔍 Demo + Official streamlit demo: [![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-web-dashboard.streamlit.app) @@ -131,6 +133,7 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/). ## 🏆 Sponsors +
Browserbase @@ -156,15 +159,18 @@ Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegra [![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/) [![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai) -## 📈 Telemetry -We collect anonymous usage metrics to enhance our package's quality and user experience. The data helps us prioritize improvements and ensure compatibility. If you wish to opt-out, set the environment variable SCRAPEGRAPHAI_TELEMETRY_ENABLED=false. For more information, please refer to the documentation [here](https://scrapegraph-ai.readthedocs.io/en/latest/scrapers/telemetry.html). +## 📈 Telemetry +We collect anonymous usage metrics to enhance our package's quality and user experience. The data helps us prioritize improvements and ensure compatibility. If you wish to opt-out, set the environment variable SCRAPEGRAPHAI_TELEMETRY_ENABLED=false. For more information, please refer to the documentation [here](https://scrapegraph-ai.readthedocs.io/en/latest/scrapers/telemetry.html). ## ❤️ Contributors + [![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors) ## 🎓 Citations + If you have used our library for research purposes please quote us with the following reference: + ```text @misc{scrapegraph-ai, author = {Marco Perini, Lorenzo Padoan, Marco Vinciguerra}, @@ -181,11 +187,11 @@ If you have used our library for research purposes please quote us with the foll Authors_logos

-| | Contact Info | -|--------------------|----------------------| -| Marco Vinciguerra | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/marco-vinciguerra-7ba365242/) | -| Marco Perini | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/perinim/) | -| Lorenzo Padoan | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/lorenzo-padoan-4521a2154/) | +| | Contact Info | +| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Marco Vinciguerra | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/marco-vinciguerra-7ba365242/) | +| Marco Perini | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/perinim/) | +| Lorenzo Padoan | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/lorenzo-padoan-4521a2154/) | ## 📜 License diff --git a/docs/turkish.md b/docs/turkish.md new file mode 100644 index 00000000..6ca3df46 --- /dev/null +++ b/docs/turkish.md @@ -0,0 +1,208 @@ +# 🕷️ ScrapeGraphAI: You Only Scrape Once + +[English](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/README.md) | [中文](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/chinese.md) | [日本語](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/japanese.md) +| [한국어](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/korean.md) +| [Русский](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/russian.md) | [Turkish](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/turkish.md) + +[![Downloads](https://img.shields.io/pepy/dt/scrapegraphai?style=for-the-badge)](https://pepy.tech/project/scrapegraphai) +[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen?style=for-the-badge)](https://github.com/pylint-dev/pylint) +[![Pylint](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/pylint.yml?label=Pylint&logo=github&style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml) +[![CodeQL](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/codeql.yml?label=CodeQL&logo=github&style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml) +[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT) +[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX) + +ScrapeGraphAI, web siteleri ve yerel belgeler (XML, HTML, JSON, Markdown vb.) için kazıma hatları oluşturmak üzere LLM ve doğrudan grafik mantığını kullanan bir web scraping Python kütüphanesidir. + +Sadece çıkarmak istediğiniz bilgiyi belirtin; kütüphane bunu sizin için gerçekleştirecektir! + +

+ ScrapeGraphAI Hero +

+ +## 🚀 Hızlı kurulum + +ScrapeGraphAI için referans sayfası, PyPI'nin resmi sayfasında mevcuttur: [pypi](https://pypi.org/project/scrapegraphai/). + +```bash +pip install scrapegraphai + +playwright install +``` + +**NOT**: Diğer kütüphanelerle çakışmaları önlemek için kütüphaneyi bir sanal ortamda kurmanız önerilir. + +
+İsteğe Bağlı Bağımlılıklar + +Kütüphane kurulumunda ek bağımlılıklar eklenebilir: + +- Daha Fazla Dil Modeli: Fireworks, Groq, Anthropic, Hugging Face ve Nvidia AI Endpoints gibi ek dil modelleri yüklenir. + +Bu grup, Fireworks, Groq, Anthropic, Together AI, Hugging Face ve Nvidia AI Endpoints gibi ek dil modellerini kullanmanıza olanak tanır. + +```bash + pip install scrapegraphai[other-language-models] +``` + +- Anlamsal Seçenekler: Bu grup, Graphviz gibi ileri düzey anlamsal işleme araçlarını içerir. + +```bash + pip install scrapegraphai[more-semantic-options] +``` + +- Tarayıcı Seçenekleri: Bu grup, Browserbase gibi ek tarayıcı yönetim araçlarını/hizmetlerini içerir. + +```bash + pip install scrapegraphai[more-browser-options] +``` + +
+ +## 💻 Kullanım + +Bir web sitesinden (veya yerel dosyadan) bilgi almak için kullanılabilecek birçok standart kazıma hattı vardır. + +En yaygın olanı, bir kullanıcı istemi ve bir kaynak URL'si verildiğinde tek bir sayfadan bilgi çıkaran `SmartScraperGraph`'tır. + +```python +import json +from scrapegraphai.graphs import SmartScraperGraph + +# Kazıma hattı için yapılandırmayı tanımlayın + +graph_config = { +"llm": { +"api_key": "YOUR_OPENAI_APIKEY", +"model": "openai/gpt-4o-mini", +}, +"verbose": True, +"headless": False, +} + +# SmartScraperGraph örneğini oluşturun + +smart_scraper_graph = SmartScraperGraph( +prompt="Şirketin ne yaptığı, adı ve iletişim e-postası hakkında bazı bilgiler bulun.", +source="https://scrapegraphai.com/", +config=graph_config +) + +# Hattı çalıştırın + +result = smart_scraper_graph.run() +print(json.dumps(result, indent=4)) + +``` + +Çıktı, aşağıdaki gibi bir sözlük olacaktır: + +```python +{ + "company": "ScrapeGraphAI", + "name": "ScrapeGraphAI Extracting content from websites and local documents using LLM", + "contact_email": "contact@scrapegraphai.com" +} +``` + +Birden fazla sayfadan bilgi ayıklamak, Python komut dosyaları oluşturmak ve hatta ses dosyaları oluşturmak için kullanılabilecek başka işlem hatları da vardır. + +| Pipeline Name | Description | +| ----------------------- | ------------------------------------------------------------------------------------------------------------- | +| SmartScraperGraph | Single-page scraper that only needs a user prompt and an input source. | +| SearchGraph | Multi-page scraper that extracts information from the top n search results of a search engine. | +| SpeechGraph | Single-page scraper that extracts information from a website and generates an audio file. | +| ScriptCreatorGraph | Single-page scraper that extracts information from a website and generates a Python script. | +| SmartScraperMultiGraph | Multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources. | +| ScriptCreatorMultiGraph | Multi-page scraper that generates a Python script for extracting information from multiple pages and sources. | + +Bu grafiklerin her biri için çoklu versiyonu vardır. Bu, LLM'yi paralel olarak çağırmayı sağlar. + +Farklı LLM'leri API'ler aracılığıyla kullanmak mümkündür, örneğin **OpenAI**, **Groq**, **Azure** ve **Gemini**, veya **Ollama** kullanarak yerel modeller. + +Yerel modelleri kullanmak istiyorsanız, [Ollama](https://ollama.com/) kurulu olduğundan emin olun ve modelleri indirmek için **ollama pull** komutunu kullanın. + +## 🔍 Demo + +Resmi Streamlit demosu: + +[![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-web-dashboard.streamlit.app) + +Bunu doğrudan web üzerinde Google Colab kullanarak deneyin: + +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing) + +## 📖 Dokümantasyon + +ScrapeGraphAI için dokümantasyonu [buradan](https://scrapegraph-ai.readthedocs.io/en/latest/) bulabilirsiniz. + +Docusaurus'u da [buradan](https://scrapegraph-doc.onrender.com/) kontrol edin. + +## 🏆 Sponsorlar + +
+ + Browserbase + + + SerpAPI + + + Stats + + + Stats + +
+ +## 🤝 Katkıda Bulunma + +Katkıda bulunmaktan çekinmeyin ve iyileştirmeleri tartışmak ve önerilerinizi iletmek için Discord sunucumuza katılın! + +Lütfen [katkı sağlama yönergelerini](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md) inceleyin. + +[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa) +[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/) +[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai) + +## 📈 Telemetri + +Paketin kalitesini ve kullanıcı deneyimini geliştirmek için anonim kullanım istatistikleri topluyoruz. Bu veriler, iyileştirmeleri önceliklendirmemize ve uyumluluğu sağlamamıza yardımcı olur. Eğer bu verileri almak istemiyorsanız, ortam değişkenini SCRAPEGRAPHAI_TELEMETRY_ENABLED=false olarak ayarlayın. Daha fazla bilgi için lütfen dokümantasyona [buradan](https://scrapegraph-ai.readthedocs.io/en/latest/scrapers/telemetry.html) bakın. + +## ❤️ Katkıda Bulunanlar + +[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors) + +## 🎓 Atıflar + +Eğer kütüphanemizi araştırma amaçlı kullandıysanız, lütfen aşağıdaki referansla atıfta bulunun: + +```text + @misc{scrapegraph-ai, + author = {Marco Perini, Lorenzo Padoan, Marco Vinciguerra}, + title = {Scrapegraph-ai}, + year = {2024}, + url = {https://github.com/VinciGit00/Scrapegraph-ai}, + note = {A Python library for scraping leveraging large language models} + } +``` + +## Yazarlar + +

+ Authors_logos +

+ +| | İletişim Bilgisi | +| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Marco Vinciguerra | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/marco-vinciguerra-7ba365242/) | +| Marco Perini | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/perinim/) | +| Lorenzo Padoan | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/lorenzo-padoan-4521a2154/) | + +## 📜 Lisans + +ScrapeGraphAI, MIT Lisansı altında lisanslanmıştır. Daha fazla bilgi için [LİSANS](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/LICENSE) dosyasına bakın. + +## Teşekkürler + +- Projeye katkıda bulunan tüm katkı sahiplerine ve açık kaynak topluluğuna destekleri için teşekkür etmek isteriz. +- ScrapeGraphAI, yalnızca veri keşfi ve araştırma amaçları için kullanılmak üzere tasarlanmıştır. Kütüphanenin herhangi bir kötüye kullanımından sorumlu değiliz. From ffe8cd83073551f6191ffc923d50bac993ff6f73 Mon Sep 17 00:00:00 2001 From: aliwert <154356044+aliwert@users.noreply.github.com> Date: Tue, 5 Nov 2024 00:18:44 +0300 Subject: [PATCH 3/3] up --- README.md | 46 ++++++++++++++++++++-------------------------- 1 file changed, 20 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index 3ed310df..d881cd41 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,10 @@ -# 🕷️ ScrapeGraphAI: You Only Scrape Once +# 🕷️ ScrapeGraphAI: You Only Scrape Once [English](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/README.md) | [中文](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/chinese.md) | [日本語](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/japanese.md) | [한국어](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/korean.md) | [Русский](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/russian.md) | [Türkçe](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/turkish.md) + [![Downloads](https://img.shields.io/pepy/dt/scrapegraphai?style=for-the-badge)](https://pepy.tech/project/scrapegraphai) [![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen?style=for-the-badge)](https://github.com/pylint-dev/pylint) [![Pylint](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/pylint.yml?label=Pylint&logo=github&style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml) @@ -11,7 +12,7 @@ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT) [![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX) -ScrapeGraphAI is a _web scraping_ python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). +ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you! @@ -38,11 +39,9 @@ Additional dependecies can be added while installing the library: - More Language Models: additional language models are installed, such as Fireworks, Groq, Anthropic, Hugging Face, and Nvidia AI Endpoints. This group allows you to use additional language models like Fireworks, Groq, Anthropic, Together AI, Hugging Face, and Nvidia AI Endpoints. - ```bash pip install scrapegraphai[other-language-models] ``` - - Semantic Options: this group includes tools for advanced semantic processing, such as Graphviz. ```bash @@ -57,12 +56,13 @@ Additional dependecies can be added while installing the library: -## 💻 Usage +## 💻 Usage There are multiple standard scraping pipelines that can be used to extract information from a website (or local file). The most common one is the `SmartScraperGraph`, which extracts information from a single page given a user prompt and a source URL. + ```python import json from scrapegraphai.graphs import SmartScraperGraph @@ -98,17 +98,16 @@ The output will be a dictionary like the following: "contact_email": "contact@scrapegraphai.com" } ``` - There are other pipelines that can be used to extract information from multiple pages, generate Python scripts, or even generate audio files. -| Pipeline Name | Description | -| ----------------------- | ------------------------------------------------------------------------------------------------------------- | -| SmartScraperGraph | Single-page scraper that only needs a user prompt and an input source. | -| SearchGraph | Multi-page scraper that extracts information from the top n search results of a search engine. | -| SpeechGraph | Single-page scraper that extracts information from a website and generates an audio file. | -| ScriptCreatorGraph | Single-page scraper that extracts information from a website and generates a Python script. | -| SmartScraperMultiGraph | Multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources. | -| ScriptCreatorMultiGraph | Multi-page scraper that generates a Python script for extracting information from multiple pages and sources. | +| Pipeline Name | Description | +|-------------------------|------------------------------------------------------------------------------------------------------------------| +| SmartScraperGraph | Single-page scraper that only needs a user prompt and an input source. | +| SearchGraph | Multi-page scraper that extracts information from the top n search results of a search engine. | +| SpeechGraph | Single-page scraper that extracts information from a website and generates an audio file. | +| ScriptCreatorGraph | Single-page scraper that extracts information from a website and generates a Python script. | +| SmartScraperMultiGraph | Multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources. | +| ScriptCreatorMultiGraph | Multi-page scraper that generates a Python script for extracting information from multiple pages and sources. | For each of these graphs there is the multi version. It allows to make calls of the LLM in parallel. @@ -117,7 +116,6 @@ It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command, if you want to use local models. ## 🔍 Demo - Official streamlit demo: [![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-web-dashboard.streamlit.app) @@ -133,7 +131,6 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/). ## 🏆 Sponsors -
Browserbase @@ -159,18 +156,15 @@ Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegra [![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/) [![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai) -## 📈 Telemetry - +## 📈 Telemetry We collect anonymous usage metrics to enhance our package's quality and user experience. The data helps us prioritize improvements and ensure compatibility. If you wish to opt-out, set the environment variable SCRAPEGRAPHAI_TELEMETRY_ENABLED=false. For more information, please refer to the documentation [here](https://scrapegraph-ai.readthedocs.io/en/latest/scrapers/telemetry.html). -## ❤️ Contributors +## ❤️ Contributors [![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors) ## 🎓 Citations - If you have used our library for research purposes please quote us with the following reference: - ```text @misc{scrapegraph-ai, author = {Marco Perini, Lorenzo Padoan, Marco Vinciguerra}, @@ -187,11 +181,11 @@ If you have used our library for research purposes please quote us with the foll Authors_logos

-| | Contact Info | -| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Marco Vinciguerra | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/marco-vinciguerra-7ba365242/) | -| Marco Perini | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/perinim/) | -| Lorenzo Padoan | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/lorenzo-padoan-4521a2154/) | +| | Contact Info | +|--------------------|----------------------| +| Marco Vinciguerra | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/marco-vinciguerra-7ba365242/) | +| Marco Perini | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/perinim/) | +| Lorenzo Padoan | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/lorenzo-padoan-4521a2154/) | ## 📜 License