Model Context Protocol (MCP) - Feature Engineering Service
This project provides a FastAPI-based server to help users ingest tabular data and automatically engineer features, making data ready for machine learning model training. It leverages PySpark for scalable data processing and supports both CSV and Parquet files.
- Data Ingestion: Load CSV or Parquet files and get schema, null stats, and sample rows.
- Automated Feature Engineering: Impute missing values, encode categorical variables, extract datetime features, and assemble features for ML.
- Interactive API: Suggests required information (e.g., categorical columns) if not provided.
- Scalable: Uses PySpark for handling large datasets.
- Python 3.8+
- Java (for PySpark)
- See
requirements.txt
for Python dependencies
# Clone the repository
# cd into the project directory
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
pip install -r requirements.txt
- Allowed file types: CSV, Parquet
- Max file size: 200MB (see
config.py
) - Data directory: Set
DATA_DIR
environment variable if needed
uvicorn main:app --reload --host 0.0.0.0 --port 8000
POST /ingest
Request Body:
{
"file_path": "path/to/data.csv",
"file_type": "csv",
"delimiter": ","
}
Response:
{
"columns": ["col1", "col2", ...],
"dtypes": {"col1": "int", "col2": "string", ...},
"null_stats": {"col1": 0, "col2": 5, ...},
"sample_rows": [ {"col1": 1, "col2": "A"}, ... ]
}
POST /feature-engineer
Request Body:
{
"target_column": "label",
"categorical_columns": ["cat1", "cat2"],
"datetime_columns": ["date_col"],
"feature_methods": ["auto"],
"custom_params": {}
}
Response:
{
"engineered_columns": ["col1", "cat1_oh", ...],
"transformations": {"cat1": "onehot", ...},
"feature_stats": {"col1": {"mean": 5.2, "count": 100}, ...},
"message": "Auto feature engineering complete"
}
If required info is missing, the API will suggest or request it interactively.
import requests
# Ingest data
resp = requests.post("http://localhost:8000/ingest", json={
"file_path": "data.csv",
"file_type": "csv",
"delimiter": ","
})
print(resp.json())
# Feature engineering
resp = requests.post("http://localhost:8000/feature-engineer", json={
"categorical_columns": ["cat1", "cat2"]
})
print(resp.json())
- Logs are printed to the console with timestamps and log levels (see
logger.py
).
MIT License. See LICENSE.