Data-Analysis-Project

Avila Bible Copyist Classification

4th year final project for "Python for Data Analysis" class.

This project focuses on classifying the copyists of the Avila Bible, a 12th-century Latin Bible manuscript. The dataset, derived from 800 images of the manuscript, contains patterns with 10 features. The task is to associate each pattern with one of the 12 copyists, labeled as A, B, C, D, E, F, G, H, I, W, X, or Y.

Task Description

Dataset

The dataset has been normalized using Z-normalization and is divided into a training set with 10,430 samples and a test set with 10,437 samples. Each pattern corresponds to a group of four consecutive rows in the manuscript.

Features

The 10 features include attributes like intercolumnar distance, upper margin, lower margin, exploitation, and others.

Class Distribution (Training Set)

A: 4,286
B: 5
C: 103
D: 352
E: 1,095
F: 1,961
G: 446
H: 519
I: 831
W: 44
X: 522
Y: 266

Dependencies

List of dependencies required to run the project.

PIP

pip install matplotlib seaborn pandas numpy panel plotly xgboost scikit-learn shap

Conda

conda install -c conda-forge matplotlib seaborn pandas numpy panel plotly xgboost scikit-learn shap

Notebook

Before diving into the steps I went through for the project, if you just want to see the notebook, it is possible here.
Note that interactions with widgets from the Panel library will not be possible.

Task Progress

Data Pre-processing

According to the provided documentation, the training and test datasets have already been split and normalized (with Z-normalization). I encoded the class labels into numerical values. The data cleaning was not needed as there was no missing values, and they were in the right format.
These steps ensure that the data is ready for modeling.

Encoding

Class	Label
A	0
B	1
C	2
D	3
E	4
F	5
G	6
H	7
I	8
W	9
X	10
Y	11

Some Data Visualizations

This bar plot clearly shows an imbalanced dataset, which we already knew about thanks to the description provided by the paleographists. But, putting this information on a picture speaks more to the human brain.

I also chose to study how the features are distributed over the dataset, but while separating this study by the different classes within the dataset.

Modeling

After comparing a bunch of different models, I succeeded to narrow it down to only three best performing models.

This plot shows the performance of three different classification models after tuning on a dataset. The models are XGBoost, Random Forest, and Bagging.
The metrics used to evaluate the models are Matthews Correlation Coefficient (MCC), F1 score, and Precision.
It is obvious that all three models have good performance on the dataset, with MCC scores above 0.8.
However, XGBoost outperforms the other two models on all three metrics. Naturally, this model will be used for the prediction.

API

API Endpoints

/: Home page
/prediction_result: Endpoint for getting predictions

Paste this in your cmd to run the Flask app

python app.py

Conclusions

This project aimed to predict the copyist behind each manuscript using machine learning techniques applied to distinctive textual features. The key outcomes and insights obtained from this analysis are highlighted below:

1. Model Accuracy and Performance

The models developed showcased promising accuracy in identifying the copyist responsible for the manuscripts. Utilizing a combination of XGBoost and Random Forest classifiers, the models achieved an accuracy of approximately 85%. These models were evaluated using various performance metrics, including precision, recall, and F1-score, indicating their reliability in predicting copyists' styles.

2. Insights Gained

Through exploratory data analysis and visualization techniques, several patterns and relationships between different textual features were uncovered. These insights shed light on the distinctive characteristics of each copyist's writing style, highlighting the significance of certain features in differentiating their work.

Citations

C. De Stefano, M. Maniaci, F. Fontanella, A. Scotto di Freca, Reliable writer identification in medieval manuscripts through page layout features: The "Avila" Bible case, Engineering Applications of Artificial Intelligence, Volume 72, 2018, pp. 99-110.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
avila		avila
static		static
templates		templates
Data_analysis_project_presentation.pdf		Data_analysis_project_presentation.pdf
Data_manipulation_for_final_project.ipynb		Data_manipulation_for_final_project.ipynb
README.md		README.md
app.py		app.py
best_model.model		best_model.model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data-Analysis-Project

Avila Bible Copyist Classification

Task Description

Dataset

Features

Class Distribution (Training Set)

Dependencies

Notebook

Task Progress

Data Pre-processing

Encoding

Some Data Visualizations

Modeling

API

API Endpoints

Conclusions

1. Model Accuracy and Performance

2. Insights Gained

Citations

About

Uh oh!

Languages

atinyshrimp/Data-Analysis-Project

Folders and files

Latest commit

History

Repository files navigation

Data-Analysis-Project

Avila Bible Copyist Classification

Task Description

Dataset

Features

Class Distribution (Training Set)

Dependencies

Notebook

Task Progress

Data Pre-processing

Encoding

Some Data Visualizations

Modeling

API

API Endpoints

Conclusions

1. Model Accuracy and Performance

2. Insights Gained

Citations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages