Comparative Analysis of Large Language Models in SQL and .NET Code Generation Tasks

Abstract

This study systematically evaluates the performance of 14 large language models (LLMs) in generating production-ready SQL stored procedures and corresponding .NET full-stack application components. The benchmark reveals significant variations in code quality, contextual handling, and adherence to explicit technical requirements. Notably, the models demonstrated divergent approaches to critical implementation challenges, including input parameter validation, error handling mechanisms, SQL syntax correctness, and code organization. The most consistent pattern across models was inadequate handling of SQL reserved words without proper bracketing—only the Qwen models maintained this best practice for Microsoft SQL Server environments. Context limitations emerged as a critical constraint, with several models (particularly the 15B-thinker model) experiencing output truncation due to exceeding context window sizes during complex code generation tasks.

The analysis identified two models with particularly strong performance characteristics: the OpenAI gpt-oss-20b model demonstrated exceptional output quality through comprehensive documentation, appropriate SQL syntax (using square brackets for column names), and a well-structured folder hierarchy for .NET implementation. The Qwen family (particularly qwen3-4b-thinking-2507) emerged as the most comprehensive generator, producing well-organized code with proper error handling and Bootstrap integration for Razor pages. Conversely, several models exhibited critical failures in production readiness, including insufficient input validation, inadequate error handling, and incomplete implementation of requested components—particularly the CodeGemma-7b model which hallucinated Python solutions for SQL tasks and crashed mid-generation.

These findings underscore the significant challenges in deploying LLMs for production code generation. The results indicate that while technical capability can be strong, successful implementation requires addressing context limitations, maintaining strict adherence to database-specific syntax conventions, and implementing robust error handling mechanisms. Future work should prioritize model fine-tuning specifically for database operations and .NET ecosystem requirements, with particular attention to the generation of production-grade code that avoids common pitfalls in parameter validation and error management. The benchmark also highlights that context window management remains a critical factor in successful complex code generation tasks.

Author: Eduardo Nicacio
Language: English
Keywords: Local LLM, Local LLM Benchmark, Code generation, T-SQL, C-sharp

Note: abstract generated by qwen/qwen3-4b-thinking-2507 via LM Studio RAG capabilities, i.e., extracting information from the existing Markdown files within this repository.

Methodology

Standardized Prompt Design

Create one comprehensive prompt that explicitly defines:

All technical requirements (SQL syntax, validation rules, error handling)
Expected output format and structure
Specific constraints (e.g., "use square brackets for column names")
Production readiness requirements (error logging, input validation)

Systematic Evaluation Framework

For each model and each coding task (T-SQL and C#), implement these checks:

Evaluation Area	Specific Checks	Pass/Fail Criteria
SQL Syntax	Proper bracketing of reserved words	Must use [brackets] for all reserved words (only Qwen models maintained this best practice)
Error Handling	Error 50001-50004 implementation	Must raise specific errors as defined in requirements
Input Validation	Parameter validation as specified	All null parameters must trigger error 50001
Production Readiness	Error logging to dbo.DbError table	Must implement TRY...CATCH with proper logging
Context Handling	Full output without truncation	No context window overflow issues
Code Quality	Code organization, documentation	Must follow .NET best practices

One-Pass Implementation Protocol

For each model, run the exact same prompt for both T-SQL generation and C# code generation (no prompt variations).
Capture all generated output in a standardized format.
Apply automated checks for:
- SQL syntax validation using SQL linters
- C# code correctness via .NET analyzers
- Error handling implementation
- Documentation completeness
- Document any output truncation or context window issues

Critical Quality Metrics

Tracked metrics per model:

Success Rate: Percentage of requirements met
Error Rate: Number of errors in generated code
Context Limitation: Percentage of outputs truncated due to context window
Production Readiness Score: 0-10 scale based on production quality
Time to Generate: Average time for one complete generation

Analysis Focus Areas

This methodology specifically tracks:

How well models handle SQL reserved words (critical finding in the document)
Performance with complex requirements that test context limits
Consistency across different code components (SQL vs. C#)
Handling of edge cases like null values and date ranges

Platform Used

The study was conducted on a high-performance Windows system, consisting in:

Motherboard: Gigabyte Aorus B550 Elite AX V2
CPU: AMD Ryzen 9 5900X (12c/24t) @4.7GHz
RAM: G.Skill F4-3200C16D-32GVK
Storage: Crucial P5 Plus 1TB Gen4 NVMe M.2 SSD
Graphics card: Sapphire PURE AMD Radeon™ RX 9070 XT
Operating System: Microsoft Windows 11 Professional 24-H2 (OS build 26200.6725)
LLM Platform: LM Studio 0.3.28
Library: llama.cpp Vulkan Windows v1.52.1

This configuration represents a typical high-end local development environment for running LLMs.

LLM Models Chosen

The benchmark evaluated 14 models across different categories:

Specialized Coding Models:
General Purpose Models:
Specialized Variants:
- qwen/qwen3-4b-thinking-2507
- mistralai/devstral-small-2507

This methodology directly addresses one's one-pass constraint while ensuring meaningful results. By standardizing the prompt and evaluation criteria, one can fairly compare models without introducing bias from prompt variations. The focus on specific failure points (like the SQL reserved words issue highlighted in the document) ensures one captures the most critical weaknesses that affect real-world deployment.

The most valuable insight from this approach would be identifying which models consistently handle complex code generation without context truncation while maintaining proper SQL syntax and error handling - the key finding from the document that only the Cohere, Mistral-AI, OpenAI and Qwen models maintained proper bracketing for SQL reserved words. Qwen models have taken a step beyond by including proper bracketing for all the columns of the given table.

This methodology provides actionable insights for developers selecting the right local LLM for production code generation tasks without compromising quality.

Normalized prompt for T-SQL code generation (MS SQL Server 2022)

Below is the normalized prompt used for all the models (including Cohere Command):

Task:

Generate SQL stored procedures for the Activity table based on the following requirements. Use T-SQL syntax and ensure the code is production-ready.

Table Definition:

CREATE TABLE [dbo].[Activity](
    [ActivityId] [uniqueidentifier] NOT NULL,
    [ProjectId] [uniqueidentifier] NOT NULL,
    [ProjectMemberId] [uniqueidentifier] NOT NULL,
    [Name] [nvarchar](128) NOT NULL,
    [Description] [nvarchar](4000) NOT NULL,
    [StartDate] [date] NULL,
    [TargetDate] [date] NULL,
    [EndDate] [date] NULL,
    [ProgressStatus] [tinyint] NULL,
    [ActivityPoints] [smallint] NULL,
    [Priority] [tinyint] NULL,
    [Risk] [tinyint] NULL,
    [Tags] [nvarchar](200) NULL,
    [ActiveFlag] [tinyint] NOT NULL,
    [SystemDeleteFlag] [char](1) NOT NULL,
    [CreatedDateTime] [datetime2](7) NOT NULL,
    [CreatedByUser] [nvarchar](100) NOT NULL,
    [CreatedByProgram] [nvarchar](100) NOT NULL,
    [UpdatedDateTime] [datetime2](7) NULL,
    [UpdatedByUser] [nvarchar](100) NULL,
    [UpdatedByProgram] [nvarchar](100) NULL,
    [SystemTimestamp] [timestamp] NOT NULL,
    CONSTRAINT [PK_Activity_ActivityId] PRIMARY KEY CLUSTERED ([ActivityId] ASC)
) ON [PRIMARY];
GO

Stored Procedures to Create:

usp_ActivityInsert:

Input parameters: All columns except UpdatedDateTime, UpdatedByUser, UpdatedByProgram, and SystemTimestamp.
Explicitly include CreatedDateTime, CreatedByUser, and CreatedByProgram as input parameters.
Default nullable columns to NULL.

usp_ActivityDelete:

Input parameters: ActivityId, UpdatedDateTime, UpdatedByUser, UpdatedByProgram, and SystemTimestamp.
Soft-delete by setting SystemDeleteFlag = 'Y'.
Default UpdatedDateTime, UpdatedByUser, and UpdatedByProgram to SYSUTCDATETIME(), SYSTEM_USER, and APP_NAME() if null.
Use SystemTimestamp for optimistic locking.

usp_ActivityUpdate:

Input parameters: All columns except CreatedDateTime, CreatedByUser, and CreatedByProgram.
Default UpdatedDateTime, UpdatedByUser, and UpdatedByProgram to SYSUTCDATETIME(), SYSTEM_USER, and APP_NAME() if null.
Use SystemTimestamp for optimistic locking.

usp_ActivityRetrieve:

Input parameters: All columns (defaulted to NULL), except UpdatedDateTime, UpdatedByUser, and UpdatedByProgram.
Default ActiveFlag to 1 and SystemDeleteFlag to 'N'.
Use wildcard search for varchar and nvarchar columns:

(@Parameter IS NULL OR CHARINDEX(@Parameter, [Column], 0) > 0)
Use date range search for date columns:

(@DateTimeParameter IS NULL OR DateTimeColumn >= @DateTimeParameter AND DateTimeColumn < DATEADD(day, 1, @DateTimeParameter))

usp_ActivityRetrieveForList:

Retrieve ActivityId and Name for active, non-deleted records (ActiveFlag = 1 and SystemDeleteFlag <> 'Y').

General Requirements for All Stored Procedures:

Validate input parameters for type, size, and nullability.
Raise error 50001 for null parameters.
Raise error 50002 for string parameters exceeding column length.
Raise error 50003 for invalid ActiveFlag (must be 0 or 1) or SystemDeleteFlag (must be 'N' or 'Y').
Wrap SELECT, INSERT, and UPDATE statements in a TRY...CATCH block.
Log errors to dbo.DbError table in the CATCH block.
Raise error 50000 with message: "Error occurred during [Operation] operation."
For optimistic lock violations, raise error 50004 with message:

"Operation failed because another user has updated or deleted this Activity. Your changes have been lost. Please review their changes before trying again."
Enclose each stored procedure in a separate SQL code block.

Output Format:

Provide each stored procedure in a separate code block, fully functional and ready to execute.

Normalized prompt for C# code generation

Below is the normalized prompt used for all the models, including Cohere Command:

Prompt:

Based on the following SQL table definition, generate the necessary C# components for a full-stack .NET application. The components should include:

C# Model: A class representing the Activity table, including all columns as properties with appropriate data types.
C# DTOs: Data Transfer Objects (DTOs) for Create, Update, and Read operations, mapping to the relevant fields in the Activity table.
C# View Models: View models for Create, Update, and List operations, tailored for use in Razor Pages.
Entity Framework Core Repository: A generic repository with CRUD methods (Get, Add, Update, Delete) and an Activity-specific repository implementation.
Minimal API: A Minimal API controller with endpoints for CRUD operations (GET, POST, PUT, DELETE) on the Activity table.
ASP.NET Razor Pages: Separate Razor Pages for Create, Edit, Details, Delete, and Index operations, including the necessary @page models and HTML forms.
Unit Tests (NUnit): Tests for models, DTOs, and view models to validate property assignments and data annotations.
Integration Tests (Moq): Tests for EF Core repository methods, Minimal API endpoints, and Razor Pages, using Moq to simulate database interactions.

Ensure all code follows .NET best practices, includes proper validation, and is ready for production use. Provide each component in separate code blocks with clear comments."

Table Definition for Reference:

CREATE TABLE [dbo].[Activity](
    [ActivityId] [uniqueidentifier] NOT NULL,
    [ProjectId] [uniqueidentifier] NOT NULL,
    [ProjectMemberId] [uniqueidentifier] NOT NULL,
    [Name] [nvarchar](128) NOT NULL,
    [Description] [nvarchar](4000) NOT NULL,
    [StartDate] [date] NULL,
    [TargetDate] [date] NULL,
    [EndDate] [date] NULL,
    [ProgressStatus] [tinyint] NULL,
    [ActivityPoints] [smallint] NULL,
    [Priority] [tinyint] NULL,
    [Risk] [tinyint] NULL,
    [Tags] [nvarchar](200) NULL,
    [ActiveFlag] [tinyint] NOT NULL,
    [SystemDeleteFlag] [char](1) NOT NULL,
    [CreatedDateTime] [datetime2](7) NOT NULL,
    [CreatedByUser] [nvarchar](100) NOT NULL,
    [CreatedByProgram] [nvarchar](100) NOT NULL,
    [UpdatedDateTime] [datetime2](7) NULL,
    [UpdatedByUser] [nvarchar](100) NULL,
    [UpdatedByProgram] [nvarchar](100) NULL,
    [SystemTimestamp] [timestamp] NOT NULL,
    CONSTRAINT [PK_Activity_ActivityId] PRIMARY KEY CLUSTERED ([ActivityId] ASC)
) ON [PRIMARY];
GO

Expected Output Structure:

C# Model
C# DTOs (CreateDto, UpdateDto, ReadDto)
C# View Models (CreateViewModel, UpdateViewModel, ListViewModel)
EF Core Repository (Generic Repository, Activity Repository)
Minimal API (CRUD Endpoints)
Razor Pages (Create.cshtml, Edit.cshtml, Details.cshtml, Delete.cshtml, Index.cshtml, and their respective PageModels)
Unit Tests (NUnit)
Integration Tests (Moq)

Provide each component in a well-organized, production-ready format.

Results

[placeholder]

T-SQL Generation

[placeholder]

C# Models

[placeholder]

C# DTOs

[placeholder]

C# ViewModels

[placeholder]

C# Repositories

[placeholder]

Asp.Net Core Minimal APIs

[placeholder]

Asp.Net Core Razor Page(s)

[placeholder]

Unit Tests results

[placeholder]

Integration Tests results

[placeholder]

Comments

When it comes to run LLMs locally, choosing the right model for a given task is probably one of the most important decisions one can make. While some models are simply brilliant at coding tasks, other can't complete the tasks even when explicitly instructed about what to do.

[placeholder]

With all that said, choosing the right model for the tasks above allowed me to refactor an old Asp.Net project of mine as follows:

Data layer: 20h (SQL Server 2022 db with 67 tables and 146 stored procedures, totalling 17,000 lines of code).
C# Models: TBD
C# DTOs: TBD
C# ViewModels: TBD
C# Repositories: TBD
Asp.Net Core Minimal Api: TBD
Asp.Net Code Razor pages: TBD
Unit tests: TBD
Integration tests: TBD

Useful links

Found these findings useful? Leave a ⭐.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
apriel-1.5-15b-thinker		apriel-1.5-15b-thinker
codegemma-7b-GGUF		codegemma-7b-GGUF
codestral-22b-v0.1		codestral-22b-v0.1
command-a-03-2025		command-a-03-2025
deepseek-coder-v2-lite-instruct		deepseek-coder-v2-lite-instruct
devstral-small-2507		devstral-small-2507
gemma-3-12b		gemma-3-12b
gpt-oss-20b		gpt-oss-20b
granite-4-h-tiny		granite-4-h-tiny
llama-3.1-8b-instruct		llama-3.1-8b-instruct
phi-4-reasoning-plus		phi-4-reasoning-plus
qwen2.5-coder-14b		qwen2.5-coder-14b
qwen3-4b-2507		qwen3-4b-2507
qwen3-4b-thinking-2507		qwen3-4b-thinking-2507
LICENSE.txt		LICENSE.txt
benchmark_summary.md		benchmark_summary.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Comparative Analysis of Large Language Models in SQL and .NET Code Generation Tasks

Table of Contents

Abstract

Methodology

Standardized Prompt Design

Systematic Evaluation Framework

One-Pass Implementation Protocol

Critical Quality Metrics

Analysis Focus Areas

Platform Used

LLM Models Chosen

Normalized prompt for T-SQL code generation (MS SQL Server 2022)

Normalized prompt for C# code generation

Results

T-SQL Generation

C# Models

C# DTOs

C# ViewModels

C# Repositories

Asp.Net Core Minimal APIs

Asp.Net Core Razor Page(s)

Unit Tests results

Integration Tests results

Comments

Useful links

About

Uh oh!

Releases

Packages

Languages

License

EduardoNicacio/LLMBenchmark

Folders and files

Latest commit

History

Repository files navigation

Comparative Analysis of Large Language Models in SQL and .NET Code Generation Tasks

Table of Contents

Abstract

Methodology

Standardized Prompt Design

Systematic Evaluation Framework

One-Pass Implementation Protocol

Critical Quality Metrics

Analysis Focus Areas

Platform Used

LLM Models Chosen

Normalized prompt for T-SQL code generation (MS SQL Server 2022)

Normalized prompt for C# code generation

Results

T-SQL Generation

C# Models

C# DTOs

C# ViewModels

C# Repositories

Asp.Net Core Minimal APIs

Asp.Net Core Razor Page(s)

Unit Tests results

Integration Tests results

Comments

Useful links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages