Bankruptcy Prediction using data from 10-K reports

We propose a multimodal approach designed to handle sparse data in the task of bankruptcy prediction.

⚠️ Key Challenges Recap

Challenge	Description
📊 Severe Class Imbalance	Next-year bankruptcies are rare (~0.45%), making supervised learning fragile
🧮 Tabular Data with Missing Values	Missing-by-design or structural NaNs must be handled appropriately
📝 Sparse Text Modalities	Two sources of text (e.g., MD&A and Auditor comments) are high-dimensional and sparse
🔄 Multimodal Fusion	Combining heterogeneous sources (tabular + multiple texts) is nontrivial
🧪 Limited Labeled Data	Labels may be unreliable, noisy, or unavailable for certain years or firms
📈 Need for Robust Generalization	Overfitting on the majority class is easy; detecting subtle minority patterns is hard

Time-based Data Split
Training samples (from 1999-12-31 to 2017-12-31)	29752
Validation samples (from 2018-01-01 to 2018-12-31)	2908
Test samples (from 2019-01-01 to 2021-10-31)	6381
Training bankrupt samples	126
Validation bankrupt samples	19
Test bankrupt samples	47

Financial Dataset Info
Number of features	9700
Mean of number of non-NaN of features per sample	210.37
Std of number of non-NaN features per sample	93.93

Usage:

git clone
Move /home/andreasideras/research_ideas/BankruptcyPredictionAD/data/ mda_auditor.csv, numerical.csv, subset_mda_auditor.csv and subset_numerical.csv to /data folder. Subset files are just smaller dense datasets for quick experimentation (use --subset True) (INTERNAL-REMOVE IT). You can use either a stratified or a time-split on the bankrupt cases setting --split from values {'stratified','time'}.
You can run any modality by giving the argument --modality a value from {financial, mda_auditor, multimodal}, for example:

python train.py --model_type XGBoost --modality multimodal --n_estimators 100 --max_depth 10 --subset False --tfidf_dim 1000

This will train the model and save the validation set performance and hyperparameters used in the hyperparameters folder. It saves a model only when there was an improvement on the validation performance based on the config.VAL_METRIC_TO_OPTIMIZE metric.
Test the (best) saved model on the test set:

python test.py --model_type XGBoost --modality multimodal --subset False

To implement your own model, you need to extend the models.base_model.BaseModel class. If some sort of preprocessing is required (e.g., tf-idf extraction - check preprocessing.tfidf_preprocessor.TFIDFpreprocessor), create your own preproccessor by extending the preprocessing.base_preprocessor.BasePreprocessor class. If not, just use the preprocessing.pass_preprocessor.PASSpreprocessor.

Density Percentage of the Top 40 Densest Features
CashAndCashEquivalentsAtCarryingValue	92.3%
IncomeTaxExpenseBenefit	88.6%
StockholdersEquity	86.5%
Assets	86.2%
LiabilitiesAndStockholdersEquity	85.6%
NetIncomeLoss	84.7%
NetCashProvidedByUsedInFinancingActivities	81.4%
NetCashProvidedByUsedInInvestingActivities	81.4%
NetCashProvidedByUsedInOperatingActivities	81.2%
RetainedEarningsAccumulatedDeficit	80.3%
EarningsPerShareBasic	80.2%
WeightedAverageNumberOfSharesOutstandingBasic	79.6%
EarningsPerShareDiluted	79.5%
DeferredIncomeTaxExpenseBenefit	79.3%
WeightedAverageNumberOfDilutedSharesOutstanding	78.8%
ComprehensiveIncomeNetOfTax	76.6%
PropertyPlantAndEquipmentNet	76.4%
ShareBasedCompensation	75.4%
AccumulatedOtherComprehensiveIncomeLossNetOfTax	75.3%
CommonStockSharesAuthorized	72.8%
InterestExpense	71.1%
CommonStockValue	69.2%
AccumulatedDepreciationDepletionAndAmortizationPropertyPlantAndEquipment	69.0%
CommonStockSharesIssued	68.9%
CashAndCashEquivalentsPeriodIncreaseDecrease	68.3%
CommonStockParOrStatedValuePerShare	67.2%
PaymentsToAcquirePropertyPlantAndEquipment	67.1%
CurrentFederalTaxExpenseBenefit	66.6%
OperatingIncomeLoss	65.9%
Liabilities	65.8%
UnrecognizedTaxBenefits	65.0%
PropertyPlantAndEquipmentGross	64.7%
Goodwill	64.6%
DeferredFederalIncomeTaxExpenseBenefit	64.4%
CurrentStateAndLocalTaxExpenseBenefit	64.4%
CurrentIncomeTaxExpenseBenefit	63.8%
EffectiveIncomeTaxRateReconciliationAtFederalStatutoryIncomeTaxRate	61.7%
LiabilitiesCurrent	60.0%
AssetsCurrent	60.0%
CommonStockSharesOutstanding	59.8%

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
hyperparameters		hyperparameters
logs		logs
models		models
preprocessing		preprocessing
results		results
utils		utils
.gitignore		.gitignore
README.md		README.md
config.py		config.py
demo.ipynb		demo.ipynb
load_data.py		load_data.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bankruptcy Prediction using data from 10-K reports

⚠️ Key Challenges Recap

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

iit-Demokritos/MABADforBankruptcyPrediction

Folders and files

Latest commit

History

Repository files navigation

Bankruptcy Prediction using data from 10-K reports

⚠️ Key Challenges Recap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages