This project purpose is to webscrape the web by feeding an webpage URL and conver the source-page (HTML) into text with Trafilatura and retrieve answers from the page with NeuralQA (Large Datasets Question Answer).
Additionally was implemented the conversion of html tables into Excel and/or CSV format with pandas.read_html.
By POST Request
You can use any programing language to make this request, I will use Python to ilustrate how you can do it:
import requests
descraper_url = "http://127.0.0.1:8880/api/scraper"
payload = {
"url": "https://en.wikipedia.org/wiki/The_Simpsons",
"html_text": True,
"query": ["When the simpsons debut?"],
"qa_port": 8888,
"expansionterms": [],
"excel": True,
"csv": True,
"overwrite_files": False
}
response = requests.request("POST", descraper_url, json=payload)
print(response.json())| Parameter | Type | Optional | Description |
|---|---|---|---|
| url | string | ✗ | The link of the website to webscrape |
| query | array of strings | ✓ | When running NeuralQA is required to specify what data you want to retrieve |
| html_text | bolean | ✓ | Run Trafilatura - get text from webpage |
| qa_port | integer | ✓ | NeuralQA is a TCP/Ip service runing in paralel, here is possible to specify it's Port. Default is 8888 |
| expansionterms | array of strings for each query | ✓ | NeuralQA have the ability to expand queries in order to improve the results. This by adding expansion terms (keywords) in the NeuralQA request. To get the expansion terms you need to make a perliminar POST request to "http://127.0.0.1:8880/api/expand" with the simple payload {query: [array of queries]}. Get a full grasp of this funtionality with the NeuralQA Query Expansion |
| excel | bolean | ✓ | Generate Excel File with webpage tables |
| csv | bolean | ✓ | Generate CSV Files with webpage tables |
| overwrite_files | bolean | ✓ | DeScraper stores locally the scraped HTML pages and the Generated Tables, therefore, everytime you re-request the same URL you can overwrite the files switching ON this parameter (for example if the webpage has been updated) |
-
Explanation:
- First, a set of rules are used to determine which token in the query to expand. These rules are chosen to improve recall (surface relevant queries) without altering the semantics of the original query. Example rules include only expanding ADJECTIVES, ADVERBS and NOUNS ; other parts of speech are not expandable. Once expansion candidates are selected, they are then iteratively masked and a masked language model is used to predict tokens that best complete the sentence given the surrounding tokens.
-
Try it out:
-
When Query is filled with a array of strings press the "Expand Queries" button;
-
Select the candidates that best fit your queries:
- Finally when you press "Initiate DeScraper" the selected candidates will be added to the post request as
expansionterms:
-
Use DeSOTA official Manager & Tools
-
Open
Models Instalationtab -
Select the Available Tool
franciscomvargas/descraper -
Press
Start Instalation
- Go to CMD (command prompt):
- ⊞ Win + R
- Enter:
cmd - ↵ Enter
- Create Model Folder:
rmdir /S /Q %UserProfile%\Desota\Desota_Models\DeScraper
mkdir %UserProfile%\Desota\Desota_Models\DeScraper
- Download Last Release:
powershell -command "Invoke-WebRequest -Uri https://github.com/franciscomvargas/descraper/archive/refs/tags/v0.0.0.zip -OutFile %UserProfile%\DeScraper_release.zip"
- Uncompress Release:
tar -xzvf %UserProfile%\DeScraper_release.zip -C %UserProfile%\Desota\Desota_Models\DeScraper --strip-components 1
- Delete Compressed Release:
del %UserProfile%\DeScraper_release.zip
- Setup:
%UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.setup.bat
- Optional Arguments:
arg Description Example /debug Log everything (useful for debug) %UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.setup.bat /debug/manualstart Don't start at end of setup %UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.setup.bat /manualstart
- Go to Terminal:
- Ctrl + Alt + T
- Create Model Folder:
rm -rf ~/Desota/Desota_Models/DeScraper
mkdir -p ~/Desota/Desota_Models/DeScraper
- Download Last Release:
wget https://github.com/franciscomvargas/descraper/archive/refs/tags/v0.0.0.zip -O ~/DeScraper_release.zip
- Uncompress Release:
sudo apt install libarchive-tools -y && bsdtar -xzvf ~/DeScraper_release.zip -C ~/Desota/Desota_Models/DeScraper --strip-components=1
- Delete Compressed Release:
rm -rf ~/DeScraper_release.zip
- Setup:
sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.setup.bash
- Optional Arguments:
arg Description Example -d Setup with debug Echo ON sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.setup.bash -d-m Don't start service at end of setup sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.setup.bash -m
- Go to CMD as Administrator (command prompt):
- ⊞ Win + R
- Enter:
cmd - Ctrl + ⇧ Shift + ↵ Enter
```cmd
%UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.start.bat
```
```cmd
%UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.stop.bat
```
```cmd
%UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.status.bat
```
- Go to Terminal:
- Ctrl + Alt + T
```cmd
sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.start.bash
```
```cmd
sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.stop.bash
```
```cmd
bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.status.bash
```
Use DeSOTA official Manager & Tools
-
Open
Models Dashboardtab -
Select the model
franciscomvargas/descraper -
Press
Uninstall
- Go to CMD as Administrator (command prompt):
- ⊞ Win + R
- Enter:
cmd - Ctrl + ⇧ Shift + ↵ Enter
%UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.uninstall.bat
-
Optional
Argumentsarg Description Example /Q Uninstall without requiring user interaction %UserProfile%\Desota\Desota_Models\DeScraper\executables\Windows\descraper.uninstall.bat /Q
- Go to Terminal:
- Ctrl + Alt + T
sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.uninstall.bash
-
Optional
Argumentsarg Description Example -q Uninstall without requiring user interaction sudo bash ~/Desota/Desota_Models/DeScraper/executables/Linux/descraper.uninstall.bash -q
Credits / License
Credits / License
@inproceedings{
barbaresi-2021-trafilatura,
title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
author = "Barbaresi, Adrien",
booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
pages = "122--131",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-demo.15",
year = 2021,
}@article{
dibia2020neuralqa,
title={NeuralQA: A Usable Library for Question Answering (Contextual Query Expansion + BERT) on Large Datasets},
author={Victor Dibia},
year={2020},
journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations}
}