This repository contains a complete pipeline for fine-tuning Google's BigBird transformer for NSFW (Not Safe For Work) text classification and deploying it as a live API on Hugging Face Spaces using Gradio. The solution is optimized for identifying inappropriate textual content in long-form documents — useful for child protection systems, content moderation, and safe browsing applications.
- ✅ Fine-tuned
google/bigbird-roberta-base
model for multi-label NSFW text classification. - ✅ Supports long text inputs (up to 4096 tokens).
- ✅ Lightweight inference API with
Gradio
. - ✅ Hosted on Hugging Face Spaces for public access.
- ✅ Google Colab training notebook included.
- ✅ Easily customizable for other text classification tasks.
- Architecture: BigBird-RoBERTa-base
- Task: Binary classification (Safe = 0, NSFW = 1)
- Input: Raw text paragraphs
- Output: 0 (Safe), 1 (Not Safe)
The model is trained on a labeled dataset (NSFW1.csv
) consisting of text segments with corresponding binary labels indicating safety.
To run the API locally:
git clone https://github.com/rusiru-erandaka/Bigbird_huggingface_deploy.git
cd Bigbird_huggingface_deploy
pip install -r requirements.txt
python app.py
Bigbird_huggingface_deploy/
- Bigbird2_.ipynb # Training notebook (Google Colab)
- app.py # Gradio API app
- onfig.json # Model config
- tokenizer_config.json # Tokenizer config
- pecial_tokens_map.json # Tokenizer special tokens
- spiece.model # SentencePiece model
- requirements.txt # Python dependencies
- LICENSE
- Upload your fine-tuned model and tokenizer files to the Hugging Face Model Hub.
- Deploy
app.py
on Hugging Face Spaces using Gradio as the interface. - Ensure that the
model_id
inapp.py
matches your model repository name, e.g.,"Rerandaka/Cild_safety_bigbird"
.
from transformers import pipeline
classifier = pipeline("text-classification", model="Rerandaka/Cild_safety_bigbird")
classifier("This is an inappropriate message involving violence.")
# Output: [{'label': 'LABEL_1', 'score': 0.98}]
- Optimizer: AdamW
- Epochs: 3–5
- Evaluation Metrics: Accuracy, Precision, Recall
- Dataset Split: 80% Train / 20% Test
- Platform: Google Colab with GPU
transformers
torch
gradio
This repository is licensed under the MIT License.
- Hugging Face 🤗 for Transformers and Spaces
- Google for BigBird-RoBERTa
- Gradio for interactive UI
- Rusiru Erandaka for fine-tuning and deployment