Skip to content

WebScrape with assistance of AI Models (Trafilatura & NeuralQA)

License

franciscomvargas/descraper

Repository files navigation

About DeScraper

UI Show Off

Description

This project purpose is to webscrape the web by feeding an webpage URL and conver the source-page (HTML) into text with Trafilatura and retrieve answers from the page with NeuralQA (Large Datasets Question Answer).

Additionally was implemented the conversion of html tables into Excel and/or CSV format with pandas.read_html.

How it works

First Steps:

Make your first request:

By POST Request

You can use any programing language to make this request, I will use Python to ilustrate how you can do it:

Payload Explanation

import requests

descraper_url = "http://127.0.0.1:8880/api/scraper"

payload = {
    "url": "https://en.wikipedia.org/wiki/The_Simpsons",
    "html_text": True,
    "query": ["When the simpsons debut?"],
    "qa_port": 8888,
    "expansionterms": [],
    "excel": True,
    "csv": True,
    "overwrite_files": False
}

response = requests.request("POST", descraper_url, json=payload)

print(response.json())
By User Interface
  • Click here to search for Descraper!
  • Fill with payload parameters:

Documentation

UI Payload Explanation

Payload Explanation

Parameter Type Optional Description
url string The link of the website to webscrape
query array of strings When running NeuralQA is required to specify what data you want to retrieve
html_text bolean Run Trafilatura - get text from webpage
qa_port integer NeuralQA is a TCP/Ip service runing in paralel, here is possible to specify it's Port. Default is 8888
expansionterms array of strings for each query NeuralQA have the ability to expand queries in order to improve the results. This by adding expansion terms (keywords) in the NeuralQA request. To get the expansion terms you need to make a perliminar POST request to "http://127.0.0.1:8880/api/expand" with the simple payload {query: [array of queries]}. Get a full grasp of this funtionality with the NeuralQA Query Expansion
excel bolean Generate Excel File with webpage tables
csv bolean Generate CSV Files with webpage tables
overwrite_files bolean DeScraper stores locally the scraped HTML pages and the Generated Tables, therefore, everytime you re-request the same URL you can overwrite the files switching ON this parameter (for example if the webpage has been updated)

NeuralQA Query Expansion

  • Explanation:

    • First, a set of rules are used to determine which token in the query to expand. These rules are chosen to improve recall (surface relevant queries) without altering the semantics of the original query. Example rules include only expanding ADJECTIVES, ADVERBS and NOUNS ; other parts of speech are not expandable. Once expansion candidates are selected, they are then iteratively masked and a masked language model is used to predict tokens that best complete the sentence given the surrounding tokens.
  • Try it out:

    1. When Query is filled with a array of strings press the "Expand Queries" button;

    2. Select the candidates that best fit your queries:

    UI Expand Queries

    1. Finally when you press "Initiate DeScraper" the selected candidates will be added to the post request as expansionterms:

    Request With Expansion Terms

Instalation

Use DeSOTA official Manager & Tools

  1. Download Installer for your Platform

  2. Open Models Instalation tab

  3. Select the Available Tool franciscomvargas/descraper

  4. Press Start Instalation

Manual Windows Instalation

  • Go to CMD (command prompt):
    • ⊞ Win + R
    • Enter: cmd
    • ↵ Enter

Download:

  1. Create Model Folder:
rmdir /S /Q %UserProfile%\Desota\Desota_Models\DeScraper
mkdir %UserProfile%\Desota\Desota_Models\DeScraper
  1. Download Last Release:
powershell -command "Invoke-WebRequest -Uri https://github.com/franciscomvargas/descraper/archive/refs/tags/v0.0.0.zip -OutFile %UserProfile%\DeScraper_release.zip" 
  1. Uncompress Release:
tar -xzvf %UserProfile%\DeScraper_release.zip -C %UserProfile%\Desota\Desota_Models\DeScraper --strip-components 1 
  1. Delete Compressed Release:
del %UserProfile%\DeScraper_release.zip

Setup:

  1. Setup:
%UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.setup.bat
  • Optional Arguments:
    arg Description Example
    /debug Log everything (useful for debug) %UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.setup.bat /debug
    /manualstart Don't start at end of setup %UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.setup.bat /manualstart

Manual Linux Instalation

  • Go to Terminal:
    • Ctrl + Alt + T

Download:

  1. Create Model Folder:
rm -rf ~/Desota/Desota_Models/DeScraper
mkdir -p ~/Desota/Desota_Models/DeScraper
  1. Download Last Release:
wget https://github.com/franciscomvargas/descraper/archive/refs/tags/v0.0.0.zip -O ~/DeScraper_release.zip
  1. Uncompress Release:
sudo apt install libarchive-tools -y && bsdtar -xzvf ~/DeScraper_release.zip -C ~/Desota/Desota_Models/DeScraper --strip-components=1
  1. Delete Compressed Release:
rm -rf ~/DeScraper_release.zip

Setup:

  1. Setup:
sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.setup.bash
  • Optional Arguments:
    arg Description Example
    -d Setup with debug Echo ON sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.setup.bash -d
    -m Don't start service at end of setup sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.setup.bash -m

Service Operations

Windows

  • Go to CMD as Administrator (command prompt):
    • ⊞ Win + R
    • Enter: cmd
    • Ctrl + ⇧ Shift + ↵ Enter

Start Service

```cmd
%UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.start.bat

```

Stop Service

```cmd
%UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.stop.bat

```

Status Service

```cmd
%UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.status.bat

```

Linux

  • Go to Terminal:
    • Ctrl + Alt + T

Start Service

```cmd
sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.start.bash
```

Stop Service

```cmd
sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.stop.bash

```

Status Service

```cmd
bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.status.bash

```

Uninstalation

Use DeSOTA official Manager & Tools

  1. Open Models Dashboard tab

  2. Select the model franciscomvargas/descraper

  3. Press Uninstall

Manual Windows Uninstalation

  • Go to CMD as Administrator (command prompt):
    • ⊞ Win + R
    • Enter: cmd
    • Ctrl + ⇧ Shift + ↵ Enter
%UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.uninstall.bat
  • Optional Arguments

    arg Description Example
    /Q Uninstall without requiring user interaction %UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.uninstall.bat /Q

Manual Linux Uninstalation

  • Go to Terminal:
    • Ctrl + Alt + T
sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.uninstall.bash
  • Optional Arguments

    arg Description Example
    -q Uninstall without requiring user interaction sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.uninstall.bash -q

Credits / License

@inproceedings{
  barbaresi-2021-trafilatura,
  title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
  author = "Barbaresi, Adrien",
  booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
  pages = "122--131",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-demo.15",
  year = 2021,
}
@article{
  dibia2020neuralqa,
  title={NeuralQA: A Usable Library for Question Answering (Contextual Query Expansion + BERT) on Large Datasets},
  author={Victor Dibia},
  year={2020},
  journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations}
}

About

WebScrape with assistance of AI Models (Trafilatura & NeuralQA)

Resources

License

Stars

Watchers

Forks

Packages

No packages published