You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A project by AI/ML Intern @ Infosys Springboard, Summer 2024.
14
-
15
-
## Data Collection
16
-
Merged selective dataset from
17
-
- CNN, Daily Mail : News,
18
-
- BillSum: Legal,
19
-
- ArXiv : Scientific,
20
-
- Dialoguesum : Conversations.
21
-
22
-
Robust, multi-objective data.
23
-
24
-
## Models
25
-
26
-
## Method 1
27
-
### -> Native PyTorch
28
-
### Model Training
29
-
30
-
Implemented in model.ipynb .
31
-
Retraining the pre-trained transformer model with our derived dataset. Native pytorch method is utilized, rather than using training pipeline API, to get more control of model training and its parameters.
32
-
33
-
Outlay:
34
-
- Setup & initialization
35
-
- training loop
36
-
- Evaluation loop
37
-
38
-
Training loss = 1.3280
39
-
40
-
saved_model - file structure:
41
-
42
-
```
43
-
44
-
└── fine_tuned_bart
45
-
46
-
├── config.json
13
+
A project by <em>Mohan Krishna G R</em>, AI/ML Intern @ Infosys Springboard, Summer 2024.
14
+
15
+
## Contents
16
+
-[Problem Statement](#problem-statement)
17
+
-[Project Statement](#project-statement)
18
+
-[Approach to Solution](#approach-to-solution)
19
+
-[Background Research](#background-research)
20
+
-[Solution](#solution)
21
+
-[Workflow](#workflow)
22
+
-[Data Collection](#data-collection)
23
+
-[Abstractive Text Summarization](#abstractive-text-summarization)
24
+
-[Extractive Text Summarization](#extractive-text-summarization)
25
+
-[Testing](#testing)
26
+
-[Deployment](#deployment)
27
+
-[Containerization](#containerization)
28
+
-[CI/CD Pipeline](#cicd-pipeline)
29
+
30
+
## Problem Statement
31
+
- Developing an automated text summarization system that can accurately and efficiently condense large bodies of text into concise summaries is essential for enhancing business operations.
32
+
- This project aims to deploy NLP techniques to create a robust text summarization tool capable of handling various types of documents across different domains.
33
+
- The system should deliver high-quality summaries that retain the core information and contextual meaning of the original text.
34
+
35
+
## Project Statement
36
+
- Text Summarization focuses on converting large bodies of text into a few sentences summing up the gist of the larger text.
37
+
- There is a wide variety of applications for text summarization including News Summary, Customer Reviews, Research Papers, etc.
38
+
- This project aims to understand the importance of text summarization and apply different techniques to fulfill the purpose.
- Data Preprocessing & Pre-processing Implemented in `src/data_preprocessing`.
70
+
- Data collection from different sources:
71
+
- CNN, Daily Mail: News
72
+
- BillSum: Legal
73
+
- ArXiv: Scientific
74
+
- Dialoguesum: Conversations
75
+
- Data integration ensures robust and multi-objective data, including News articles, Legal Documents – Acts, Judgements, Scientific papers, and Conversations.
76
+
- Validated the data through Data Statistics and Exploratory Data Analysis (EDA) using Frequency Plotting for every data source.
77
+
- Data cleansing optimized for NLP tasks: removed null records, lowercasing, punctuation removal, stop words removal, and lemmatization.
78
+
- Data splitting using sci-kit learn for training, testing, and validating the model, saved in CSV format.
79
+
80
+
## Abstractive Text Summarization
81
+
### Model Training & Evaluation
82
+
-**Training:**
83
+
- Selected transformer architecture for ABSTRACTIVE SUMMARIZATION: fine-tuning a pre-trained model.
84
+
- Chosen Facebook’s Bart Large model for its performance metrics and efficient trainable parameters.
- Significantly higher than typical scores reported for state-of-the-art models on common datasets.
112
+
- GPT4 performance for text summarization - ROUGE1 (F-Measure) is 63.22
113
+
- Selected for further deployment.
62
114
63
-
### Model Validation (Custom)
64
-
65
-
Implemented in src/evaluation.ipynb .
66
-
Performance metrics - ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is best suited to evaluate the model's performance for 'Text Summarizer'.
67
-
68
-
Aimed to implement a custom evaluation function that calculate ROGUE based on model's inference.
69
-
70
-
Even though the model has very minimal training loss but, the model performed inconsistenly in validation & testing phase. There's a suspected tensor error while training using method 1, which could be attributed to the inconsistency of the model's output.
71
-
72
-
## Method 2
73
-
### -> Trainer Method
74
-
### Model Training
75
-
Implemented in src/bart.ipynb .
76
-
A function was implemented for the dataset, to convert text data into model inputs and targets. Trainer class from transformer package was utilized for training and evaluation. Tainer is a simple but feature-complete training and eval loop for PyTorch, optimized for transformers.
77
-
78
-
The model was trained with whole dataset for 10 epochs for 26:24:22 (HH:MM:SS) in 125420 steps.
79
-
80
-
Training Loss = 0.174700
81
-
82
-
### Model Validation
83
-
Implemented in src/rouge.ipynb .
84
-
ROUGE score is again used as the performance metric for model evaluation.
85
-
86
-
Intresting, this model is quite robust in consistent performance with the Rogue1 score as 61.3224, which is a benchmark standard.
87
-
88
-
## Model Selection for deployment
89
-
Considered the performance metrics of the models trained by the forementioned methods.
90
-
91
-
After the due analysis, the model trained using 'Method 2' was selected.
92
-
93
-
## Model Deployment
94
-
### Testing Interface
95
-
Gradio - an open-source Python package that allows us to quickly build a demo - web-application for the trained models.
96
-
For the initial phase after model validation, the 'gradio' library is best suited for our objective.
97
-
98
-
Implemented in src/interface.ipynb .
99
-
100
-
### Deployment (Ongoing)
101
-
Ultilized FastAPI for backend development, fined tunned transformers (summarizer/saved_model) for text summarization, and Docker for deployment for a robust and scalable text summarization application capable of handling various input sources and generating concise summaries efficiently.
102
-
103
-
Implmented extractor modules to handles various input sources (URL, pdf, docx, txt)
104
-
105
-
To use the docker image run:
115
+
- Comparative analysis showed significant improvement in performance after fine-tuning. Source code: `src/compare.ipynb`
- Rather than choosing computationally intensive deep-learning models, utilizing a rule based approach will result in optimal solution. Utilized a new-and-novel approach of combining the matrix obtained from TF-IDF and KMeans Clustering methodology.
122
+
- It is the expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. It operates at the individual document and cluster level.
123
+
- The sentence closest to the centroid (based on Euclidean distance) is selected as the representative sentence for that cluster.
124
+
-**Implementation:** Preprocess text, extract features using TF-IDF, and summarize by selecting representative sentences.
125
+
- Source code for implentation & evaluation: `src/Extractive_Summarization.ipynb`
126
+
- ROUGE1 (F-Measure) = 24.71
127
+
128
+
## Testing
129
+
- Implemented text summarization application using Gradio library for a web-based interface, for testing the model's inference.
Thank you for your interest in our project! We welcome any feedback. Feel free to reach out to us.
225
+
### Deployment (Ongoing)
226
+
Ultilized FastAPI for backend development, fined tunned transformers (summarizer/saved_model) for text summarization, and Docker for deployment for a robust and scalable text summarization application capable of handling various input sources and generating concise summaries efficiently.
227
+
228
+
Implmented extractor modules to handles various input sources (URL, pdf, docx, txt)
0 commit comments