Skip to content

Commit da3fbb3

Browse files
Update with CI/CD
1 parent 36ffbc8 commit da3fbb3

File tree

1 file changed

+190
-91
lines changed

1 file changed

+190
-91
lines changed

README.md

Lines changed: 190 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -10,99 +10,175 @@
1010
![Docker Image Size](https://img.shields.io/docker/image-size/mohankrishnagr/infosys_text-summarization/final)
1111

1212
# Infosy_Text-Summarization
13-
A project by AI/ML Intern @ Infosys Springboard, Summer 2024.
14-
15-
## Data Collection
16-
Merged selective dataset from
17-
- CNN, Daily Mail : News,
18-
- BillSum: Legal,
19-
- ArXiv : Scientific,
20-
- Dialoguesum : Conversations.
21-
22-
Robust, multi-objective data.
23-
24-
## Models
25-
26-
## Method 1
27-
### -> Native PyTorch
28-
### Model Training
29-
30-
Implemented in model.ipynb .
31-
Retraining the pre-trained transformer model with our derived dataset. Native pytorch method is utilized, rather than using training pipeline API, to get more control of model training and its parameters.
32-
33-
Outlay:
34-
- Setup & initialization
35-
- training loop
36-
- Evaluation loop
37-
38-
Training loss = 1.3280
39-
40-
saved_model - file structure:
41-
42-
```
43-
44-
└── fine_tuned_bart
45-
46-
├── config.json
13+
A project by <em>Mohan Krishna G R</em>, AI/ML Intern @ Infosys Springboard, Summer 2024.
14+
15+
## Contents
16+
- [Problem Statement](#problem-statement)
17+
- [Project Statement](#project-statement)
18+
- [Approach to Solution](#approach-to-solution)
19+
- [Background Research](#background-research)
20+
- [Solution](#solution)
21+
- [Workflow](#workflow)
22+
- [Data Collection](#data-collection)
23+
- [Abstractive Text Summarization](#abstractive-text-summarization)
24+
- [Extractive Text Summarization](#extractive-text-summarization)
25+
- [Testing](#testing)
26+
- [Deployment](#deployment)
27+
- [Containerization](#containerization)
28+
- [CI/CD Pipeline](#cicd-pipeline)
29+
30+
## Problem Statement
31+
- Developing an automated text summarization system that can accurately and efficiently condense large bodies of text into concise summaries is essential for enhancing business operations.
32+
- This project aims to deploy NLP techniques to create a robust text summarization tool capable of handling various types of documents across different domains.
33+
- The system should deliver high-quality summaries that retain the core information and contextual meaning of the original text.
34+
35+
## Project Statement
36+
- Text Summarization focuses on converting large bodies of text into a few sentences summing up the gist of the larger text.
37+
- There is a wide variety of applications for text summarization including News Summary, Customer Reviews, Research Papers, etc.
38+
- This project aims to understand the importance of text summarization and apply different techniques to fulfill the purpose.
39+
40+
## Approach to Solution
41+
- **Figure:** Intended Plan
42+
<div align="center">
43+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1cn429WFQzvF1eDwEFiLsIk87M8KBELt8&export=download" border="0"></a>
44+
</div>
4745

48-
├── generation_config.json
4946

50-
├── merges.txt
47+
## Background Research
48+
- **Literature Review**
49+
<div align="center">
50+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1201kWfyGURgsA32u6Xe_WPSrO0izo8Fg&export=download" border="0"></a>
51+
</div>
5152

52-
├── model.safetensors
53+
## Solution
54+
- **Selected Deep Learning Architecture**
5355

54-
├── special_tokens_map.json
56+
## Workflow
57+
- Workflow for Abstractive Text Summarizer:
58+
<div align="center">
59+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1-smea28F10cOnmXXUj24QkzEZL-ffhWt&export=download" border="0"></a>
60+
</div><br>
5561

56-
├── tokenizer_config copy.json
62+
- Workflow for Extractive Text Summarizer:
63+
<div align="center">
64+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1vS2Gm5ccJvjxH7fsnyOf3ARk2pNTR75p&export=download" border="0"></a>
65+
</div>
5766

58-
├── tokenizer_config.json
5967

60-
└── vocab.json
61-
```
68+
## Data Collection
69+
- Data Preprocessing & Pre-processing Implemented in `src/data_preprocessing`.
70+
- Data collection from different sources:
71+
- CNN, Daily Mail: News
72+
- BillSum: Legal
73+
- ArXiv: Scientific
74+
- Dialoguesum: Conversations
75+
- Data integration ensures robust and multi-objective data, including News articles, Legal Documents – Acts, Judgements, Scientific papers, and Conversations.
76+
- Validated the data through Data Statistics and Exploratory Data Analysis (EDA) using Frequency Plotting for every data source.
77+
- Data cleansing optimized for NLP tasks: removed null records, lowercasing, punctuation removal, stop words removal, and lemmatization.
78+
- Data splitting using sci-kit learn for training, testing, and validating the model, saved in CSV format.
79+
80+
## Abstractive Text Summarization
81+
### Model Training & Evaluation
82+
- **Training:**
83+
- Selected transformer architecture for ABSTRACTIVE SUMMARIZATION: fine-tuning a pre-trained model.
84+
- Chosen Facebook’s Bart Large model for its performance metrics and efficient trainable parameters.
85+
- 406,291,456 training parameters.
86+
87+
<div align="center">
88+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1fe7MMx_-kEAN9c0QVJbsMj9dBUNgEZX8&export=download" border="0"></a>
89+
</div><br>
90+
91+
- **Methods:**
92+
- Native PyTorch Implementation
93+
- Trainer API Implementation
94+
95+
### Method 1 - Native PyTorch
96+
- Trained the model using manual training loop and evaluation loop in PyTorch. Implemented in: `src/model.ipynb`
97+
- **Model Evaluation:** Source code:`src/evaluation.ipynb`
98+
- Obtained inconsistent results in inferencing.
99+
- ROUGE1 (F-Measure) = 00.018
100+
- There's a suspected tensor error while training using method 1, which could be attributed to the inconsistency of the model's output.
101+
- Rejected for the further deployment.
102+
- Dire need to implement alternative approach.
103+
104+
### Method 2 – Trainer Class Implementation
105+
- Utilized Trainer API from Hugging Face for optimized transformer model training. Implemented in: `src/bart.ipynb`
106+
- The model was trained with whole dataset for 10 epochs for 26:24:22 (HH:MM:SS) in 125420 steps.
107+
108+
- **Evaluation:** Performance metrics using ROUGE scores. Source code: `src/rouge.ipynb`
109+
- Model 2 - results outperformed that of method 1.
110+
- <strong>ROUGE1 (F-Measure) = 61.32</strong> -> Benchmark grade
111+
- Significantly higher than typical scores reported for state-of-the-art models on common datasets.
112+
- GPT4 performance for text summarization - ROUGE1 (F-Measure) is 63.22
113+
- Selected for further deployment.
62114

63-
### Model Validation (Custom)
64-
65-
Implemented in src/evaluation.ipynb .
66-
Performance metrics - ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is best suited to evaluate the model's performance for 'Text Summarizer'.
67-
68-
Aimed to implement a custom evaluation function that calculate ROGUE based on model's inference.
69-
70-
Even though the model has very minimal training loss but, the model performed inconsistenly in validation & testing phase. There's a suspected tensor error while training using method 1, which could be attributed to the inconsistency of the model's output.
71-
72-
## Method 2
73-
### -> Trainer Method
74-
### Model Training
75-
Implemented in src/bart.ipynb .
76-
A function was implemented for the dataset, to convert text data into model inputs and targets. Trainer class from transformer package was utilized for training and evaluation. Tainer is a simple but feature-complete training and eval loop for PyTorch, optimized for transformers.
77-
78-
The model was trained with whole dataset for 10 epochs for 26:24:22 (HH:MM:SS) in 125420 steps.
79-
80-
Training Loss = 0.174700
81-
82-
### Model Validation
83-
Implemented in src/rouge.ipynb .
84-
ROUGE score is again used as the performance metric for model evaluation.
85-
86-
Intresting, this model is quite robust in consistent performance with the Rogue1 score as 61.3224, which is a benchmark standard.
87-
88-
## Model Selection for deployment
89-
Considered the performance metrics of the models trained by the forementioned methods.
90-
91-
After the due analysis, the model trained using 'Method 2' was selected.
92-
93-
## Model Deployment
94-
### Testing Interface
95-
Gradio - an open-source Python package that allows us to quickly build a demo - web-application for the trained models.
96-
For the initial phase after model validation, the 'gradio' library is best suited for our objective.
97-
98-
Implemented in src/interface.ipynb .
99-
100-
### Deployment (Ongoing)
101-
Ultilized FastAPI for backend development, fined tunned transformers (summarizer/saved_model) for text summarization, and Docker for deployment for a robust and scalable text summarization application capable of handling various input sources and generating concise summaries efficiently.
102-
103-
Implmented extractor modules to handles various input sources (URL, pdf, docx, txt)
104-
105-
To use the docker image run:
115+
- Comparative analysis showed significant improvement in performance after fine-tuning. Source code: `src/compare.ipynb`
116+
<div align="center">
117+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1V4u8ohFNFcceidx3l43LNjxTbtLZ233g&export=download" border="0"></a>
118+
</div><br>
119+
120+
## Extractive Text Summarization
121+
- Rather than choosing computationally intensive deep-learning models, utilizing a rule based approach will result in optimal solution. Utilized a new-and-novel approach of combining the matrix obtained from TF-IDF and KMeans Clustering methodology.
122+
- It is the expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. It operates at the individual document and cluster level.
123+
- The sentence closest to the centroid (based on Euclidean distance) is selected as the representative sentence for that cluster.
124+
- **Implementation:** Preprocess text, extract features using TF-IDF, and summarize by selecting representative sentences.
125+
- Source code for implentation & evaluation: `src/Extractive_Summarization.ipynb`
126+
- ROUGE1 (F-Measure) = 24.71
127+
128+
## Testing
129+
- Implemented text summarization application using Gradio library for a web-based interface, for testing the model's inference.
130+
- **Source Code:** `src/interface.ipynb`
131+
<div align="center">
132+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1YSNYZJl25zKHkOSJl7suxty3wy-cyUMe&export=download" border="0"></a>
133+
</div><br>
134+
135+
## Deployment
136+
<div align="center">
137+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1mvbC3IZzRxS0Hx0DoO6EvrQyKqgD--Gw&export=download" border="0"></a>
138+
</div><br>
139+
140+
### Application
141+
- **File Structure:** `summarize/`
142+
<div align="center">
143+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1OnHuW8YMPQYT88pqPbWAZCpgEFlic0kw&export=download" width="320" height="320" border="0"></a>
144+
</div><br>
145+
146+
### API Endpoints
147+
- Developed using FastAPI framework for handling URLs, files, and direct text input.
148+
- **Source Code:** `summarizer/app.py`
149+
- **Endpoints:**
150+
- Root Endpoint
151+
- Summarize URL
152+
- Summarize File
153+
- Summarize Text
154+
155+
### Extractor Modules
156+
- Extract text from various sources (URLs, PDF, DOCX) using BeautifulSoup and fitz.
157+
- **Source Code:** `summarizer/extractors.py`
158+
159+
### Extractive Summary Script
160+
- Implemented extractive summarizer module. Same as implemented in: src/bart.ipynb
161+
- **Source Code:** `summarizer/extractive_summary.py`
162+
163+
### User Interface
164+
- Developed a user-friendly interface using HTML, CSS, and JavaScript.
165+
- **Source Code:** `summarizer/templates/index.html`
166+
<div align="center">
167+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1EydlT7J-pZF4bgmLHD2isQCsp-_TS3uE&export=download" border="0"></a>
168+
</div><br>
169+
170+
## Containerization
171+
- Developed a Dockerfile to build a Docker image for the FastAPI application.
172+
- **Source Code:** `summarizer/Dockerfile`
173+
- **Image:** [Docker Image](https://hub.docker.com/layers/mohankrishnagr/infosys_text-summarization/group/images/sha256-28802ba2a3b30d36b94fbd878c97585c02c813534fc80fdca5e81494b96bfd08?context=explore)
174+
175+
## CI/CD Pipeline
176+
- Developed a CI/CD pipeline using Docker, Azure and GitHub Actions.
177+
- Utilized Azure Container Instance (ACI) for deploying the image, triggers for every push to the main branch.
178+
- **Source Code:** `.github/workflows/azure.yml`
179+
- `.github/workflows/main.yml` (AWS)
180+
- `.github/workflows/azure.yml` (Azure)
181+
- To use the docker image run:
106182
```
107183
docker pull mohankrishnagr/infosys_text-summarization:final
108184
docker run -p 8000:8000 mohankrishnagr/infosys_text-summarization:final
@@ -111,20 +187,43 @@ Then checkout at,
111187
```
112188
http://localhost:8000/
113189
```
114-
### Deployed in AWS EC2
115-
Public IP:
190+
### Deployed in AWS EC2 (Not Recommended under free trail)
191+
Public IPv4:
116192
```
117193
http://54.168.82.95/
118194
```
119195

120-
### Deployed in Azure Container Instance
121-
Public IP:
196+
### Deployed in Azure Container Instance (Recommended)
197+
Public IPv4:
122198
```
123199
http://20.219.203.134:8000/
124200
```
125201
FQDN
126202
```
127203
http://mohankrishnagr.centralindia.azurecontainer.io:8000/
128204
```
129-
## Extractive Summarization Model
130-
Implemented rule-based approach.
205+
206+
- **Screenshots:**
207+
<div align="center">
208+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1m2OYe7u1fS4yulQLyxYUs7kjmpFa5RND&export=download" border="0"></a>
209+
</div><br>
210+
<div align="center">
211+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1m2OYe7u1fS4yulQLyxYUs7kjmpFa5RND&export=download" border="0"></a>
212+
</div><br>
213+
<div align="center">
214+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1x2v1rZGpcZiAVHyGDy6pKRqrTSTtnPpm&export=download" border="0"></a>
215+
</div><br>
216+
<div align="center">
217+
<a><img src="https://drive.usercontent.google.com/u/0/uc?id=1xsUfXRTERjEUevk__bOo37on4hGUOSnT&export=download" border="0"></a>
218+
</div><br>
219+
</i>
220+
221+
----
222+
223+
### End Note
224+
Thank you for your interest in our project! We welcome any feedback. Feel free to reach out to us.
225+
### Deployment (Ongoing)
226+
Ultilized FastAPI for backend development, fined tunned transformers (summarizer/saved_model) for text summarization, and Docker for deployment for a robust and scalable text summarization application capable of handling various input sources and generating concise summaries efficiently.
227+
228+
Implmented extractor modules to handles various input sources (URL, pdf, docx, txt)
229+

0 commit comments

Comments
 (0)