Large Language Models (LLMs) have demonstrated significant potential for automating data engineering tasks on tabular data, giving enterprises a valuable opportunity to reduce the high costs associated with manual data handling. However, the enterprise domain introduces unique challenges that existing LLM-based approaches for data engineering often overlook, such as large table sizes, more complex tasks, and the need for internal knowledge. To bridge these gaps, we identify key enterprise-specific challenges related to data, tasks, and background knowledge and conduct a comprehensive study of their impact on recent LLMs for data engineering. Our analysis reveals that LLMs face substantial limitations in real-world enterprise scenarios, resulting in significant accuracy drops. Our findings contribute to a systematic understanding of LLMs for enterprise data engineering to support their adoption in industry.
Please find our prompt templates and example prompts in PROMPTS.md
!
Experiment implemented in experiments/enterprise_challenges_cta
, task implemented in tasks/column_type_annotation
.
Experiments implemented in experiments/enterprise_data_headers_types_cta
, task implemented in
tasks/column_type_annotation
.
Experiment implemented in experiments/enterprise_data_sparsity_width_cta
, task implemented in
tasks/column_type_annotation
.
Experiments implemented in experiments/enterprise_tasks_pay_to_inv
, task implemented in tasks/entity_matching
.
Experiments implemented in experiments/enterprise_tasks_compound
, task implemented in tasks/compound_task
.
Experiment implemented in experiments/enterprise_knowledge_text2signal
, task implemented in tasks/text2signal
.
Experiment implemented in experiments/enterprise_knowledge_schema_prediction
, task implemented in
tasks/schema_prediction
.
Experiment implemented in experiments/costs_imdb_wikipedia_enterprisetables
.
Make sure you have Python 3.13 installed.
Create a virtual environment, activate it, install the dependencies, and add the project to the Python path:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
export PYTHONPATH=${PYTHONPATH}:./
Run bash test.sh
to run the test suite and bash reproduce.sh
to run the experiments.
To execute API requests, you must also prepare OpenAI, Anthropic, Ollama, Hugging Face, and SAP AI Core.
To reproduce our exact results using the same model responses, you must paste the cached requests and responses into
data/openai_cache
, data/anthropic_cache
, data/ollama_cache
, and data/aicore_cache
.
You must store your API keys in an environment variables:
export OPENAI_API_KEY="<your-key>"
export ANTHROPIC_API_KEY="<your-key>"
export HF_TOKEN="<your-key>"
Use the Ollama Docker Container:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama pull llama3.1:70b-instruct-fp16
Cleanup:
docker ps
docker stop <container-id>
docker container rm <container-id>
docker volume rm ollama
Redacted SAP internal code.
The repository is structured into tasks, config, data, experiments, and library code.
Tasks like column type annotation and entity matching are implemented in tasks/<task-name>
. Each task can have
multiple datasets, like tasks/column_type_annotation/sportstables
.
Each task is implemented as a pipeline of Python scripts:
- Download the original dataset specific to each task and dataset
- Preprocess to generate evaluation instances specific to each task and dataset
- Prepare API requests specific to each task
- Execute API requests same for all tasks
- Parse API responses specific to each task
- Evaluates the predictions specific to each task
Configuration for tasks and datasets uses Hydra and is stored in config
.
The prompt templates for each task are stored in config/<task-name>/config.yaml
.
Data is stored in data/<task-name>/<dataset-name>
.
For each dataset, the original download is placed in data/<task-name>/<dataset-name>/download
.
The experiment runs are stored in data/<task-name>/<dataset-name>/<experiments>/<experiment-name>
. Each experiment
consists of:
instances
as sequentially numbered directories (one for each instance)requests
as sequentially numbered JSON filesresponses
as sequentially numbered JSON filespredictions
as sequentially numbered directories (one for each prediction)results
Experiment implementations and their results are stored in experiments/<experiment-name>
. Each experiment typically
conducts a sweep of experiment runs for a task (implemented in run.sh
) before gathering their results (implemented in
gather.py
) and plotting them (implemented in plot.py
).
Library code is implemented in llms4de
, is mostly functional, and has pytest tests.