This project was developed as part of the FedCSIS 2024 Data Science Challenge to predict stock trading actions (buy, sell, or hold) based on financial statement data. It utilizes Python libraries such as Pandas, NumPy, Scikit-learn, and Matplotlib, along with a Random Forest Classifier for machine learning-based predictions.
The task in this challenge is to design an accurate method for predicting a trading action (buy, sell, hold). The available training data contains 8,000 instances with fundamental financial data in a tabular format (CSV file with semicolons used as separators). Each instance in data represents an event – a financial statement announcement for one of the chosen 300 companies. It contains information on the company’s sector, values for 58 key financial indicators, 1-year (absolute) change for each of the 58 indicators, target class information (the column 'Class'), and risk-return performance for a period after the announcement (the column 'Perform').
Please note that the data contains two distinct types of missing values that have different semantics. One corresponds to non-available/missing information and another one can be interpreted as non-applicable. In data, one type is marked by "NA" string and another is just an empty string (there is no value).
Solution format: the test data, containing 2,000 instances, is also provided as a CSV file. The test file has the same format and naming scheme as the training data but it does not contain columns 'Class' and 'Perform'.
Solutions in this competition should be submitted to the online evaluation system as a text file with exactly 2,000 lines containing predictions for test instances. Each line in the submission should contain a single number from the set {1, 0, -1} that indicates the predicted trading action for the event. The ordering of predictions should be the same as the ordering of the test set.
Evaluation: the quality of submissions will be evaluated using the average error cost measure with the error cost matrix given below:
-1 | 0 | 1 | |
---|---|---|---|
-1 | 0 | 1 | 2 |
0 | 1 | 0 | 1 |
1 | 2 | 1 | 0 |
In particular, the error value is computed as: err = (confusion_matrix(preds, gt) * cost_matrix)/length(gt)), where the multiplication is done element-wise.
- Column_Names_Dictionary.csv: Contains 117 rows with column codes (e.g., I1, dI58), indices, and indicator names (e.g., Return on Average Total Assets - %, TTM), providing metadata for the financial indicators.
- Group_Dictionary.csv: Lists 11 rows mapping G-codes (e.g., G1 to G11) to sector names (e.g., Financials, Health Care) and their corresponding numerical indices.
- Training_Data.csv: Includes 8,000 rows and 119 columns, with 28,933 missing values and 1,136 zero values. Columns include 'Group', 116 financial indicators (I1 to dI58), 'Class', and 'Perform'.
- Test_Data_No_Target.csv: Contains 2,000 rows and 116 columns, with 6,784 missing values and 668 zero values after preprocessing, excluding 'Class' and 'Perform'.
- Data Loading: Datasets were imported using Pandas with semicolon delimiters.
- Exploratory Data Analysis (EDA): Initial analysis revealed the structure and statistics of the data, including the distribution of the 'Class' column (mean 0.084, std 0.922, range -1 to 1).
- Handling Missing Values: Missing values (e.g., "NA" and empty strings) were replaced with NaN and filled with column means. Zero values were preserved, noting an increase in their frequency in some columns.
- Data Transformation: Object-type numerical values (with commas as decimals) were converted to float by replacing commas with dots.
- Feature Scaling: MinMaxScaler normalized feature values between 0 and 1, excluding 'Class' and 'Perform'.
- Data Splitting: The dataset was split into 80% training (X_train, y_train) and 20% validation (X_val, y_val) sets using train_test_split with a random state of 42.
- Target Classification: The 'Perform' column was categorized into classes (-1, 0, 1) using pd.cut with thresholds at -0.2 and 0.2.
- Model Training: A Random Forest Classifier with 100 estimators and random_state=42 was trained on the training set.
- Evaluation: The model achieved a weighted F1 score of 0.7495 and a validation error of 0.176875, calculated using a 3x3 confusion matrix and the provided cost matrix.
To use this project, follow these steps:
- Clone the repository:
git clone https://github.com/hootbu/FedCSIS2024-DataScienceChallange.git
- Navigate to the project directory:
cd FedCSIS2024-DataScienceChallange
- Open
project.ipynb
to execute the code, review the analysis, and generate predictions.