Welcome to the GDELT Text Extraction API project β a toolkit designed to streamline the extraction, processing, and annotation of structured JSON data from GDELT (Global Database of Events, Language, and Tone). This interface supports efficient collection and analysis of news content, with both manual and automated workflows.
- Flexible Search Interface: Customize queries by tone, language, keywords, repetition thresholds, and more.
- Search Persistence: Save search configurations and retrieved URLs for future reuse.
- Structured Output: Extract article content as well-formatted JSON, ready for downstream NLP tasks.
- Manual Annotation Tool: Web-based interface to annotate disinformation/misinformation signals in JSON format.
- Automated Processing: Easily execute scripted routines for targeted domain-level data collection.
- Auto-Annotation Support: Annotate extracted content using local LLMs in batch mode.
Clone the repository, create venv and install dependencies:
git clone https://github.com/tiziano777/GDELT_scraping
cd GDELT_scraping
python -m venv gdelt_env
pip install -r requirements.txt
Navigate to the GDELT_scraping
root directory and create the following subfolders:
mkdir src/search_log
mkdir src/search_results
mkdir src/raw_text_data
mkdir src/annotated_data
mkdir src/EDA/topic
To launch the interactive dashboard, run:
streamlit run DataGatheringDashboard.py
For automatic data scraping, customize the filters in AutoScraper.py
and execute:
python AutoScraper.py
The EDA component supports:
- Aggregation of collected article metadata
- Clustering by user-defined topical categories
- Exploratory analyses via word frequency statistics (wordFrequence Folder)
To aggregate data from the raw_text_data
directory:
cd src/EDA
python aggregate_results.py
To assign articles to topics:
- Edit the
create_keyword_sets()
function insimilarity_layer.py
, using the provided JSON schema. - Then run:
python similarity_layer.py
This project incorporates and builds upon components from the following open-source repository: GDELT Doc API.