This repository contains a Python script that generates a synthetic dataset of 10,000 soldiers to model and predict the likelihood of military defection based on various individual and external factors. The dataset is designed to simulate real-world scenarios and is useful for research, analysis, and modeling purposes.
- Introduction
- Features
- Dataset Generation Process
- Defection Risk Calculation
- Usage
- Conclusion
- Next Steps
Military defection is a complex phenomenon influenced by a multitude of factors at both the individual and external levels. Understanding these factors is crucial for predicting defection risk and implementing strategies to maintain military cohesion. This project simulates a dataset that captures these factors, allowing for analysis and modeling of defection behaviors.
The dataset includes a variety of features that are divided into individual-level and external-level factors. Each feature is carefully generated to reflect realistic distributions and relationships with the likelihood of defection.
-
Opportunity Cost (
opportunity_cost_norm
): The potential benefits a soldier foregoes by remaining in service. Higher values increase defection risk. -
Family Military History (
family_military_history_bin
): Indicates if a soldier has family members who served in the military. Contrary to intuition, in this model, it positively correlates with defection risk. -
Security Clearance Level (
security_clearance_level_num
): Access to sensitive information. Higher clearance levels increase defection risk due to the value of the information. -
Has Enemy Connections (
has_enemy_connections_bin
): Indicates connections with enemy forces. Presence of such connections increases defection risk. -
Morale Score (
morale_score_norm
): Overall satisfaction and morale. Higher morale decreases defection risk. -
Trust in Leadership Score (
trust_in_leadership_score_norm
): Trust in military leadership. Higher trust decreases defection risk.
-
Punishment Policy (
punishment_policy_num
): Perceived strictness of anti-defection policies. Lenient policies increase defection risk. -
Regime Type (
regime_type_num
): Type of political regime. Personalist regimes have higher defection risk. -
Military Structure (
military_structure_num
): Level of institutionalization. Institutionalized structures have higher defection risk due to formal systems. -
Defector Capture Rate (
defector_capture_rate_norm
): Rate at which defectors are captured. Higher rates decrease defection risk. -
Promotion Fairness Score (
promotion_fairness_score_norm
): Fairness of promotion systems. Fair systems decrease defection risk. -
Communication Quality Score (
communication_quality_score_norm
): Quality of communication within the military. Better communication decreases defection risk.
The dataset is generated through a series of steps, each responsible for creating and processing different features.
import numpy as np
import pandas as pd
# Set random seed for reproducibility
np.random.seed(0)
# Number of soldiers
n = 10000
# Initialize empty DataFrame
df = pd.DataFrame()
- Description: Represents the security clearance of each soldier.
- Values: 'low' (60%), 'medium' (30%), 'high' (10%).
df['security_clearance_level'] = np.random.choice(
['low', 'medium', 'high'], size=n, p=[0.6, 0.3, 0.1]
)
# Map to numerical values
security_clearance_mapping = {'low': 1, 'medium': 2, 'high': 3}
df['security_clearance_level_num'] = df['security_clearance_level'].map(security_clearance_mapping)
- Description: Represents the morale level of each soldier.
- Values: Random float between 0 and 100.
df['morale_score'] = np.random.uniform(0, 100, size=n)
- Description: Indicates if a soldier has family members who served in the military.
- Values: Binary (0 or 1), with a 10% chance of being 1.
df['family_military_history_bin'] = np.random.binomial(1, 0.1, size=n)
- Description: Perceived strictness of anti-defection policies.
- Values: 'strict' (80%), 'lenient' (20%).
df['punishment_policy'] = np.random.choice(
['strict', 'lenient'], size=n, p=[0.8, 0.2]
)
# Encode to numerical values
df['punishment_policy_num'] = df['punishment_policy'].map({'strict': 0, 'lenient': 1})
- Description: Type of political regime.
- Values: 'personalist'.
df['regime_type'] = 'personalist'
# Encode to numerical values
df['regime_type_num'] = df['regime_type'].map({'personalist': 1, 'party-based': 0})
- Description: Level of trust in military leadership.
- Values: Random float between 0 and 100.
df['trust_in_leadership_score'] = np.random.uniform(0, 100, size=n)
- Description: Level of institutionalization in the military.
- Values: 'patrimonial'.
df['military_structure'] = 'patrimonial'
# Encode to numerical values
df['military_structure_num'] = df['military_structure'].map({'patrimonial': 0, 'institutionalized': 1})
- Description: Rate at which defectors are captured.
- Values: Constant at 0.8.
df['defector_capture_rate'] = 0.8
- Description: Fairness of the promotion system.
- Values: 80% between 0 and 40, 20% between 40 and 100.
df['promotion_fairness_score'] = np.where(
np.random.rand(n) < 0.8,
np.random.uniform(0, 40, size=n),
np.random.uniform(40, 100, size=n)
)
- Description: Quality of communication within the military.
- Values: Random float between 0 and 100.
df['communication_quality_score'] = np.random.uniform(0, 100, size=n)
- Description: Potential benefits a soldier foregoes by remaining in service.
- Values: Random float between 0 and 5000, adjusted based on security clearance.
df['opportunity_cost'] = np.random.uniform(0, 5000, size=n)
# Adjust based on security clearance
df['opportunity_cost'] = df.apply(
lambda row: row['opportunity_cost'] if row['security_clearance_level'] == 'low' else
row['opportunity_cost'] * 0.5 if row['security_clearance_level'] == 'medium' else
row['opportunity_cost'] * 0.2,
axis=1
)
# Normalize
df['opportunity_cost_norm'] = (
df['opportunity_cost'] - df['opportunity_cost'].min()
) / (df['opportunity_cost'].max() - df['opportunity_cost'].min())
- Description: Indicates if a soldier has connections with enemy forces.
- Values: 10% 'Yes', 90% 'No', with 95% of values set to NaN to reflect missing data.
df['has_enemy_connections'] = np.where(
np.random.binomial(1, 0.1, size=n) == 1, 'Yes', 'No'
)
# Set 95% to NaN
missing_mask = np.random.rand(n) < 0.95
df.loc[missing_mask, 'has_enemy_connections'] = np.nan
# Encode to binary
df['has_enemy_connections_bin'] = df['has_enemy_connections'].map({'Yes': 1, 'No': 0}).fillna(0)
Weights are assigned to each feature based on their positive or negative relationship with defection risk.
-
Positive Relationship (increase defection risk):
Feature Weight opportunity_cost_norm +0.4 security_clearance_level_num +0.3 has_enemy_connections_bin +0.5 punishment_policy_num +0.2 regime_type_num +0.3 military_structure_num +0.2 -
Negative Relationship (decrease defection risk):
Feature Weight family_military_history_bin -0.2 morale_score_norm -0.3 trust_in_leadership_score_norm -0.3 promotion_fairness_score_norm -0.2 communication_quality_score_norm -0.1 defector_capture_rate_norm -0.4
Features are normalized to ensure they are on the same scale (0 to 1).
# Normalize negative relationship features
df['morale_score_norm'] = df['morale_score'] / 100
df['trust_in_leadership_score_norm'] = df['trust_in_leadership_score'] / 100
df['promotion_fairness_score_norm'] = df['promotion_fairness_score'] / 100
df['communication_quality_score_norm'] = df['communication_quality_score'] / 100
df['defector_capture_rate_norm'] = df['defector_capture_rate'] # Already between 0 and 1
# Normalize security clearance level
df['security_clearance_level_norm'] = df['security_clearance_level_num'] / df['security_clearance_level_num'].max()
The defection risk score is calculated by summing the weighted contributions of each feature.
df['defection_risk_score'] = (
weights['opportunity_cost_norm'] * df['opportunity_cost_norm'] +
weights['family_military_history_bin'] * df['family_military_history_bin'] +
weights['security_clearance_level_num'] * df['security_clearance_level_norm'] +
weights['has_enemy_connections_bin'] * df['has_enemy_connections_bin'] +
weights['morale_score_norm'] * df['morale_score_norm'] +
weights['trust_in_leadership_score_norm'] * df['trust_in_leadership_score_norm'] +
weights['promotion_fairness_score_norm'] * df['promotion_fairness_score_norm'] +
weights['communication_quality_score_norm'] * df['communication_quality_score_norm'] +
weights['defector_capture_rate_norm'] * df['defector_capture_rate_norm'] +
weights['punishment_policy_num'] * df['punishment_policy_num'] +
weights['regime_type_num'] * df['regime_type_num'] +
weights['military_structure_num'] * df['military_structure_num']
)
A threshold is set based on the median defection risk score.
threshold = df['defection_risk_score'].median()
Soldiers are classified as 'yes' (will defect) or 'no' (will not defect) based on whether their defection risk score exceeds the threshold.
df['will_defect'] = np.where(df['defection_risk_score'] > threshold, 'yes', 'no')
-
Clone the Repository:
git clone https://github.com/yourusername/defection-risk-dataset.git cd defection-risk-dataset
-
Install Dependencies:
Ensure you have Python 3.x installed along with
numpy
andpandas
.pip install numpy pandas
-
Run the Script:
python generate_dataset.py
-
Explore the Dataset:
The script will generate
soldier_defection_dataset.csv
containing the synthetic data.
This project provides a comprehensive synthetic dataset that models the complex factors influencing military defection. By assigning weights based on domain knowledge and normalizing features, we create a realistic simulation useful for analytical and predictive purposes.
-
Model Training: Use the dataset to train machine learning models to predict defection.
-
Validation: Validate the synthetic data and model predictions against real-world data if available.
-
Feature Refinement: Adjust feature weights and distributions based on additional research or data.
-
Temporal Analysis: Incorporate the
time_of_measurement
feature to study how defection risk changes over time. -
Scenario Simulation: Modify constant features like
defector_capture_rate
andregime_type
to simulate different scenarios and their impact on defection risk.
For questions or suggestions, please contact official.tanmay1306@gmail.com.
Disclaimer: This dataset is synthetic and created for educational and research purposes. It does not represent real individuals or events.