Chinese Fake News Detection based on MacBERT

A deep learning–powered system for probabilistic fake news detection in Chinese-language news

Overview

DeFake-ZH is a deep learning–based system for Chinese fake news detection, integrating MacBERT, MGP Database Matching, and Contradiction Detection.
Trained on over 390k news articles (Sept. 2024–May 2025), the system achieves 99.12% accuracy and 0.9912 F1-score, outperforming previous studies.

System Architecture

Figure: Overall system architecture of DeFake-ZH

News Crawler: collects verified and unverified articles from mainstream Taiwanese media and fact-checking organizations.
Data Preparation: splits each article into Title and Content fields for subsequent analysis.
Preprocessing: text cleaning, deduplication, Chinese tokenization, and embedding extraction.
Detection Modules:
- MacBERT + MLP Classifier: supervised classification using contextual embeddings.
- Database Matching: compares input with verified/fake entries in the MGP database.
- Contradiction Detection: checks logical contradictions via knowledge triplets and Natural Language Inference (NLI).
- LLM-Scored Features: headlines scored by LLaMA 3.1-8B-Instruct for sentiment, subjectivity, and extreme expressions. These features enhance interpretability and help detect implicit disinformation.
GUI Interface: interactive system for quick and full analysis, providing probability scores with expandable interpretability reports.

Methodology

(a) Database Construction

Sources: see Dataset for full list of news outlets and fact-checking references.
Articles: over 390k (verified + unverified).
Additional scoring: LLaMA 3.1-8B-Instruct used to generate auxiliary labels on tone and sentiment.

(b) Data Preparation

Extract title and content.
Normalize and structure for downstream processing.

(c) Preprocessing

Text cleaning.
Duplicate removal.
Chinese word segmentation.
Embedding extraction via MacBERT.

(d) Detection Modules

MacBERT Embedding + MLP Classifier – supervised classification using contextual embeddings.
Database Matching – checks for matches with known fake or verified news. High similarity provides strong signals.
Contradiction Detection – detects inconsistencies with known facts using triplet extraction / natural language inference (NLI).
LLM-Scored Features – evaluates sentiment, subjectivity, and provocative tone, displayed in GUI for interpretability.

Experimental Results

Figure: Experimental results comparing two settings. (Left) MGP-only baseline. (Right) MGP+MacBERT.

System	Precision	Recall	F1-score	Accuracy
Only MGP	0.8421	0.7908	0.7827	0.7908
MGP + MacBERT	0.9913	0.9912	0.9912	0.9912

Only-MGP: High precision (97.45%) for fake news detection, but low recall (59.72%), leading to poor balance and indicating a high miss rate, which lowers the overall accuracy to 79.08%.
MGP+MacBERT: Balanced across all metrics, achieving >99% consistently.

Literature Comparison

	THIS PROJ.	[1]	[2]
Language	Chinese	Chinese	Chinese
Method	MacBERT	Keras Sequential	BERT
Accuracy	99.12%	90.53%	-
F1-score	0.9912	-	0.6701

The proposed system achieved an accuracy of 99.12% and an F1-score of 0.9912, clearly outperforming prior studies.
In particular, [2] (Lin, 2021) applied a BERT-based model to classify CoFacts data into binary categories, achieving a best-case Macro-F1 score of 0.6701.

Dataset

⚠️ Due to copyright restrictions, the dataset is not distributed with this repository.
Users may collect comparable data from the listed sources and fact-checking organizations.
Or contact us for further information regarding data access.

News Sources

Collected news data come from a variety of Taiwanese media outlets:

CNA, CTS, FTV, LTN, Mirror Media, PTS, SETN, TTV, TVBS, UDN

The dataset includes the following classification labels:

entertain_sports – Entertainment & Sports
international – International News
local_society – Local & Society
politics – Politics
technology_life – Technology & Life

Fact-checking References

DeFake-ZH is also informed by fact-checking resources from trusted Taiwanese organizations:

Project Structure

`data/`

data/
├── db/          # Reference DBs/indexes (.json/.pkl/.faiss)
├── features/    # Extracted feature tensors (.npy/.pt)
├── processed/   # Cleaned and split data (train/val/test)
└── raw/         # Raw CSV/JSON news data

`models/`

models/
├── bge-m3/                  # Sentence embeddings (BAAI/bge-m3)
├── chinese-macbert-large/   # Chinese MacBERT (hfl/chinese-macbert-large)
├── ltp/                     # HIT-SCIR LTP models
├── task/                    # Task-specific classifier
├── text2vec/                # Chinese sentence embeddings
├── word2vec/                # Chinese word embeddings
└── README.md

`src/`

src/
├── gui.py                # Gradio app entry point (Blocks UI & events)
├── interface.py          # Pipeline orchestrator (single-/multi-machine; helper clients)
├── otherGUI.py           # Shared GUI components & CSS hooks
├── scores.py             # Scoring & summary rendering
├── mgpSearch.py          # MGP database search (sentence similarity, OpenCC, threshold≈0.75)
├── contradiction.py      # Triplet extraction (LTP) + NLI contradiction checks (with caching)
├── classifier.py         # PyTorch MLP classifier + metrics
├── trainClassifier.py    # Training script
├── buildDatabase.py      # DB/index builder (merge, sentence split, triplets)
├── dataPreparation.py    # Cleaning, deduplication, tokenization, splits
├── featureEngineering.py # Embedding/feature pipelines (MacBERT/BGE/Text2Vec/Word2Vec)
├── PU_Learning.py        # PU learning: initial labeling with TF‑IDF + logistic regression
├── helper.py             # Helper server runner (contrad/MGP) for remote calls
├── const.py              # Global constants and canonical paths
└── nodes.py              # Embedding & vector store utils (FAISS, LangChain docs)

`tests/`

tests/
├── predict_single_news.py   # Single‑news prediction demo
├── testMLP.py               # Plot training/validation curves for MLP
└── testAll.py               # Analyze results: confusion matrices & PR/F1 across phases

Installation

1. Clone repository

git clone https://github.com/yachiashen/DeFake-ZH.git
cd DeFake-ZH
git lfs install
git lfs pull

Note: Git LFS

This repository uses Git LFS to manage large files (e.g., models, databases).
Please make sure you have Git LFS installed:
# Install Git LFS (only once per system)
git lfs install

# After cloning the repository, fetch large files
git lfs pull

2. Create environment

conda env update -f environment.yml --prune
conda activate defake-zh
pip install -r requirements.txt

3. Download / Prepare models

chinese-macbert-large: auto-download via Hugging Face transformers
Others (bge-m3, ltp, word2vec, text2vec, task): download manually as described in models/README.md

Ensure the models/ directory and its subfolders exist before running.

Usage

Launch Gradio interface

cd src
python gui.py

After launching, Gradio will display a local URL in the terminal (e.g., http://127.0.0.1:7860/).
Open it in your browser, enter a News Title and News Content, then choose Quick Analysis or Full Analysis.

The system outputs:
- Database matches
- Contradiction detection results
- MacBERT classification with sentence-level scores
- Final summary (expandable for detailed interpretability)

Or try it online (with limited functionality): DeFake-ZH on Hugging Face Spaces

Contributors

Note

This project was carried out as part of the Undergraduate Capstone Project at the Department of Computer Science and Information Engineering, NCKU, 2025.

Acknowledgements

Prof. Fan-Hsun Tseng,
for his guidance and supervision throughout the project.

License

MIT License

All news articles are copyrighted by their original publishers and fact-check platforms. Please comply with their usage policies.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
docs		docs
models		models
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

License

yachiashen/DeFake-ZH

Folders and files

Latest commit

History

Repository files navigation

Chinese Fake News Detection based on MacBERT

Overview

System Architecture

Methodology

(a) Database Construction

(b) Data Preparation

(c) Preprocessing

(d) Detection Modules

Experimental Results

Literature Comparison

Dataset

News Sources

Fact-checking References

Project Structure

data/

models/

src/

tests/

Installation

1. Clone repository

Note: Git LFS

2. Create environment

3. Download / Prepare models

Usage

Launch Gradio interface

Contributors

Note

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`data/`

`models/`

`src/`

`tests/`

Packages