A deep learning–powered system for probabilistic fake news detection in Chinese-language news
DeFake-ZH is a deep learning–based system for Chinese fake news detection, integrating MacBERT, MGP Database Matching, and Contradiction Detection.
Trained on over 390k news articles (Sept. 2024–May 2025), the system achieves 99.12% accuracy and 0.9912 F1-score, outperforming previous studies.
Figure: Overall system architecture of DeFake-ZH
- News Crawler: collects verified and unverified articles from mainstream Taiwanese media and fact-checking organizations.
- Data Preparation: splits each article into Title and Content fields for subsequent analysis.
- Preprocessing: text cleaning, deduplication, Chinese tokenization, and embedding extraction.
- Detection Modules:
- MacBERT + MLP Classifier: supervised classification using contextual embeddings.
- Database Matching: compares input with verified/fake entries in the MGP database.
- Contradiction Detection: checks logical contradictions via knowledge triplets and Natural Language Inference (NLI).
- LLM-Scored Features: headlines scored by LLaMA 3.1-8B-Instruct for sentiment, subjectivity, and extreme expressions. These features enhance interpretability and help detect implicit disinformation.
 
- GUI Interface: interactive system for quick and full analysis, providing probability scores with expandable interpretability reports.
- Sources: see Dataset for full list of news outlets and fact-checking references.
- Articles: over 390k (verified + unverified).
- Additional scoring: LLaMA 3.1-8B-Instruct used to generate auxiliary labels on tone and sentiment.
- Extract title and content.
- Normalize and structure for downstream processing.
- Text cleaning.
- Duplicate removal.
- Chinese word segmentation.
- Embedding extraction via MacBERT.
- MacBERT Embedding + MLP Classifier – supervised classification using contextual embeddings.
- Database Matching – checks for matches with known fake or verified news. High similarity provides strong signals.
- Contradiction Detection – detects inconsistencies with known facts using triplet extraction / natural language inference (NLI).
- LLM-Scored Features – evaluates sentiment, subjectivity, and provocative tone, displayed in GUI for interpretability.
Figure: Experimental results comparing two settings. (Left) MGP-only baseline. (Right) MGP+MacBERT.
| System | Precision | Recall | F1-score | Accuracy | 
|---|---|---|---|---|
| Only MGP | 0.8421 | 0.7908 | 0.7827 | 0.7908 | 
| MGP + MacBERT | 0.9913 | 0.9912 | 0.9912 | 0.9912 | 
- Only-MGP: High precision (97.45%) for fake news detection, but low recall (59.72%), leading to poor balance and indicating a high miss rate, which lowers the overall accuracy to 79.08%.
- MGP+MacBERT: Balanced across all metrics, achieving >99% consistently.
| THIS PROJ. | [1] | [2] | |
|---|---|---|---|
| Language | Chinese | Chinese | Chinese | 
| Method | MacBERT | Keras Sequential | BERT | 
| Accuracy | 99.12% | 90.53% | - | 
| F1-score | 0.9912 | - | 0.6701 | 
The proposed system achieved an accuracy of 99.12% and an F1-score of 0.9912,
clearly outperforming prior studies.
In particular, [2] (Lin, 2021) applied a BERT-based model to classify CoFacts
data into binary categories, achieving a best-case Macro-F1 score of 0.6701.
- [1] Wang et al. (2021). Empirical Research on Fake News Detection Using AI Technology.
- [2] Lin (2021). Exploring Artificial Intelligence Technologies for Fake News Detection.
Users may collect comparable data from the listed sources and fact-checking organizations.
Or contact us for further information regarding data access.
Collected news data come from a variety of Taiwanese media outlets:
The dataset includes the following classification labels:
- entertain_sports– Entertainment & Sports
- international– International News
- local_society– Local & Society
- politics– Politics
- technology_life– Technology & Life
DeFake-ZH is also informed by fact-checking resources from trusted Taiwanese organizations:
data/
├── db/          # Reference DBs/indexes (.json/.pkl/.faiss)
├── features/    # Extracted feature tensors (.npy/.pt)
├── processed/   # Cleaned and split data (train/val/test)
└── raw/         # Raw CSV/JSON news datamodels/
├── bge-m3/                  # Sentence embeddings (BAAI/bge-m3)
├── chinese-macbert-large/   # Chinese MacBERT (hfl/chinese-macbert-large)
├── ltp/                     # HIT-SCIR LTP models
├── task/                    # Task-specific classifier
├── text2vec/                # Chinese sentence embeddings
├── word2vec/                # Chinese word embeddings
└── README.mdsrc/
├── gui.py                # Gradio app entry point (Blocks UI & events)
├── interface.py          # Pipeline orchestrator (single-/multi-machine; helper clients)
├── otherGUI.py           # Shared GUI components & CSS hooks
├── scores.py             # Scoring & summary rendering
├── mgpSearch.py          # MGP database search (sentence similarity, OpenCC, threshold≈0.75)
├── contradiction.py      # Triplet extraction (LTP) + NLI contradiction checks (with caching)
├── classifier.py         # PyTorch MLP classifier + metrics
├── trainClassifier.py    # Training script
├── buildDatabase.py      # DB/index builder (merge, sentence split, triplets)
├── dataPreparation.py    # Cleaning, deduplication, tokenization, splits
├── featureEngineering.py # Embedding/feature pipelines (MacBERT/BGE/Text2Vec/Word2Vec)
├── PU_Learning.py        # PU learning: initial labeling with TF‑IDF + logistic regression
├── helper.py             # Helper server runner (contrad/MGP) for remote calls
├── const.py              # Global constants and canonical paths
└── nodes.py              # Embedding & vector store utils (FAISS, LangChain docs)tests/
├── predict_single_news.py   # Single‑news prediction demo
├── testMLP.py               # Plot training/validation curves for MLP
└── testAll.py               # Analyze results: confusion matrices & PR/F1 across phasesgit clone https://github.com/yachiashen/DeFake-ZH.git
cd DeFake-ZH
git lfs install
git lfs pullThis repository uses Git LFS to manage large files (e.g., models, databases).
Please make sure you have Git LFS installed:# Install Git LFS (only once per system) git lfs install # After cloning the repository, fetch large files git lfs pull
conda env update -f environment.yml --prune
conda activate defake-zh
pip install -r requirements.txt- chinese-macbert-large: auto-download via Hugging Face- transformers
- Others (bge-m3,ltp,word2vec,text2vec,task): download manually as described in models/README.md
Ensure the models/ directory and its subfolders exist before running.
cd src
python gui.pyAfter launching, Gradio will display a local URL in the terminal (e.g., http://127.0.0.1:7860/).
Open it in your browser, enter a News Title and News Content, then choose Quick Analysis or Full Analysis.
- The system outputs:
- Database matches
- Contradiction detection results
- MacBERT classification with sentence-level scores
- Final summary (expandable for detailed interpretability)
 
Or try it online (with limited functionality): DeFake-ZH on Hugging Face Spaces
This project was carried out as part of the Undergraduate Capstone Project at the Department of Computer Science and Information Engineering, NCKU, 2025.
Prof. Fan-Hsun Tseng,
for his guidance and supervision throughout the project.
All news articles are copyrighted by their original publishers and fact-check platforms. Please comply with their usage policies.



