This project implements a paraphrase generation model using the T5 transformer. The model is trained on the PAWS dataset, consisting of over 49,000 labeled sentence pairs, leveraging transfer learning to enhance text generation accuracy and fluency. The pipeline includes data preprocessing, model fine-tuning, and performance evaluation, ensuring optimal training efficiency while preserving generalization.
- Paraphrase Generation: Generates diverse and fluent paraphrases for input sentences.
- Fine-tuned T5 Model: Trained using the PAWS dataset for high-quality paraphrasing.
- Optimized Training Process: Achieved a 20% reduction in fine-tuning time.
- NLP Pipeline: Covers data preprocessing, model training, optimization, and evaluation.
- Framework Compatibility: Works seamlessly with state-of-the-art NLP libraries.
- Python
- PyTorch
- Hugging Face Transformers
- PAWS Dataset
- Google Colab / Jupyter Notebook
The PAWS (Paraphrase Adversaries from Word Scrambling) dataset is used for training, which consists of:
- 49,000+ labeled sentence pairs.
- Designed to improve paraphrase generation by reducing lexical overlap biases.
- Subset Sampling: Used a 3,600-sample subset for initial fine-tuning to ensure efficient training while maintaining generalization.
- Accelerated Fine-Tuning: Achieved a 20% reduction in training time by optimizing model parameters.
- Experiment with larger T5 variants (T5-base, T5-large) for improved performance.
- Implement beam search and top-k sampling for diverse paraphrase generation.
- Fine-tune on additional datasets for domain-specific applications.
Contributions are welcome! Feel free to open issues or submit pull requests.
This project is licensed under the MIT License.
- Google Research for the PAWS dataset.
- Hugging Face for the Transformers library.
- PyTorch for providing an efficient deep learning framework.
📧 Email: utkarshranaa06@gmail.com
🔗 GitHub: utkarshranaa
🔗 LinkedIn: www.linkedin.com/in/utkarshranaa
🔗 X/Twitter: @utkarshranaa