Version: 1.2 | Last Updated: 2025-01-15
A comprehensive dbt project for transforming and analyzing LLM evaluation data from an automated GCS to BigQuery ELT pipeline. This project processes conversation turns, retrieved documents, and node evaluations to provide insights into LLM performance and capability assessment.
This project is part of an automated ELT (Extract, Load, Transform) pipeline that processes LLM evaluation data:
GCS (JSONL files) → BigQuery Staging → dbt Transformations → Analytics Tables → Hex Visualization
- Extract & Load (EL): Cloud Function loads JSONL files from GCS into BigQuery staging table
- Transform (T): dbt models transform raw data into production-ready analytics tables
- Visualization: Hex dashboards provide real-time insights into LLM performance
- Cloud Storage:
eval_results/
(raw) andprocessed_eval_results/
(archive) - BigQuery: Staging table
daily_load
instaging_eval_results_raw
dataset - dbt: Transformations run via Cloud Run Job with Docker containerization
- Scheduling: Cloud Scheduler triggers pipeline every 6 hours
staging_eval_results_raw.daily_load
- Raw JSONL data loaded from GCS with auto-detected schema
- Contains conversation turns with nested arrays for retrieved documents and evaluations
- 3-day table expiration for cost optimization
All tables are materialized as incremental models with daily partitioning on timestamp_start
.
Purpose: One row per conversation turn with core metrics and performance data.
Key Fields:
run_id
,thread_id
,turn_index
(unique key)graph_version
- Version of the graph being evaluatedtimestamp_start
,timestamp_end
- Timing informationquery
,response
- User input and LLM outputtotal_latency_ms
,graph_latency_ms
,evaluation_latency_ms
- Performance metricstime_to_first_token_ms
- Response speed indicator- Token usage for both graph and evaluation phases
Purpose: One row per retrieved document, enabling analysis of retrieval quality.
Key Fields:
run_id
,thread_id
,turn_index
(unique key)doc_content
- Retrieved document textdoc_score
- Relevance score from retrieval systemdoc_metadata
- Additional document metadata
Purpose: One row per node evaluation, tracking LLM performance across multiple dimensions.
Key Fields:
run_id
,thread_id
,turn_index
,node_name
(unique key)evaluator_name
- Name of the evaluation componentoverall_success
- Binary success indicatorclassification
- Categorization of the response- Capability Metrics:
persona_adherence
- Adherence to specified personafollows_rules
- Compliance with system rulesformat_valid
- Output format correctnessfaithfulness
- Accuracy to source materialanswer_relevance
- Relevance to user queryhandles_irrelevance
- Ability to handle irrelevant inputscontext_relevance
- Context appropriatenessincludes_key_info
- Inclusion of essential information
The evaluation system implements a multi-layered capability assessment funnel:
-
Base Functionality (
overall_success
)- Binary pass/fail at the node level
- Fundamental capability validation
-
Behavioral Compliance
persona_adherence
- Character consistencyfollows_rules
- System rule complianceformat_valid
- Output structure validation
-
Content Quality
faithfulness
- Source material accuracyanswer_relevance
- Query response relevancecontext_relevance
- Situational appropriateness
-
Advanced Capabilities
handles_irrelevance
- Noise filtering abilityincludes_key_info
- Essential information retention
- Performance Metrics: Latency, token usage, response times
- Quality Metrics: Success rates, adherence scores, relevance measures
- Operational Metrics: Graph versions, evaluation timestamps, classification data
The transformed data is visualized in Hex dashboards to provide:
- Real-time Performance Monitoring: Latency trends, success rates, token usage
- Capability Assessment: Funnel analysis showing where LLM capabilities succeed/fail
- Document Retrieval Analysis: Quality of retrieved context and relevance scores
- Version Comparison: Performance across different graph versions
- Operational Insights: System health, evaluation coverage, data quality metrics
chatraghu_dbt/
├── models/
│ ├── conversation_turns.sql # Main fact table
│ ├── retrieved_docs.sql # Document retrieval analysis
│ ├── node_evals.sql # LLM evaluation metrics
│ └── sources.yml # Source table definitions
├── Dockerfile # Production container
├── docker-compose.yml # Local development
├── dbt_project.yml # Project configuration
├── .dockerignore # Docker build exclusions
All models use incremental materialization with:
- Daily partitioning on
timestamp_start
- Unique keys for deduplication
- Schema change handling (
append_new_columns
)
Note: This project is designed to fit within GCP's free tier for low-to-moderate usage, ensuring cost-effective operation while maintaining production-grade reliability.