Get Deeper Understanding from motivation to data collection to model training to final results you can checkout the project blogs here: https://www.amitkedia.com/project/67ce1a818013ee818192b171
Introduction: This project utilizes machine learning, deep learning, and Large Language Models (LLMs) to detect financial fraud. It's based on a comprehensive dataset derived from financial filings to the U.S. Securities and Exchange Commission (SEC), aiming to compare and enhance AI models in identifying fraudulent financial activities.
Objective: The goal is to foster a collaborative platform where data scientists and researchers can develop, test, and improve AI models for detecting financial fraud.
Source: The dataset includes financial filings from 170 companies, split equally between those involved in fraudulent and non-fraudulent activities.
Structure: Each dataset entry contains details such as Central Index Key (CIK), filing year, company name, and a categorical indicator of fraud.
Final Dataset: Finally the dataset is out on Kaggle do check it out here..
Preprocessing steps involve text cleaning, tokenization, and transforming data into machine-readable formats, ensuring balanced and fair model training.
The project encompasses a variety of models, including Logistic Regression, SVM, Random Forest, XGBoost, ANN, HAN, GPT-2, and FinBERT, selected for their NLP capabilities and potential in fraud detection.
Codebase: Complete code for data extraction, preprocessing, model training, and evaluation is available in this repository.
Environment: A requirements.txt
file is provided for setting up a consistent environment.
Documentation: Each script is documented with clear instructions in the README.md
, guiding through environment setup, script execution, and result interpretation.
Getting Started:
- Fork the repository.
- Setup your environment with
requirements.txt
. - Familiarize yourself with the code and dataset.
Contributing:
- Add or improve models, or refine preprocessing methods.
- Ensure your code is documented and aligns with the project's style.
- Submit pull requests with a detailed description of changes.
Reporting Issues:
- Use GitHub Issues for bug reports, feature requests, or discussions.
- Provide detailed bug descriptions and reproduction steps.
Community:
- Engage in discussions, share results, ask questions.
- Adhere to community guidelines for a collaborative environment.
This project is open-source, available under MIT License.
Thanks to all contributors and community members for their valuable participation and insights in advancing AI in financial fraud detection.