Skip to content

EduardoNicacio/LLMBenchmark

Repository files navigation

Comparative Analysis of Large Language Models in SQL and .NET Code Generation Tasks

Table of Contents

Abstract

This study systematically evaluates the performance of 14 large language models (LLMs) in generating production-ready SQL stored procedures and corresponding .NET full-stack application components. The benchmark reveals significant variations in code quality, contextual handling, and adherence to explicit technical requirements. Notably, the models demonstrated divergent approaches to critical implementation challenges, including input parameter validation, error handling mechanisms, SQL syntax correctness, and code organization. The most consistent pattern across models was inadequate handling of SQL reserved words without proper bracketing—only the Qwen models maintained this best practice for Microsoft SQL Server environments. Context limitations emerged as a critical constraint, with several models (particularly the 15B-thinker model) experiencing output truncation due to exceeding context window sizes during complex code generation tasks.

The analysis identified two models with particularly strong performance characteristics: the OpenAI gpt-oss-20b model demonstrated exceptional output quality through comprehensive documentation, appropriate SQL syntax (using square brackets for column names), and a well-structured folder hierarchy for .NET implementation. The Qwen family (particularly qwen3-4b-thinking-2507) emerged as the most comprehensive generator, producing well-organized code with proper error handling and Bootstrap integration for Razor pages. Conversely, several models exhibited critical failures in production readiness, including insufficient input validation, inadequate error handling, and incomplete implementation of requested components—particularly the CodeGemma-7b model which hallucinated Python solutions for SQL tasks and crashed mid-generation.

These findings underscore the significant challenges in deploying LLMs for production code generation. The results indicate that while technical capability can be strong, successful implementation requires addressing context limitations, maintaining strict adherence to database-specific syntax conventions, and implementing robust error handling mechanisms. Future work should prioritize model fine-tuning specifically for database operations and .NET ecosystem requirements, with particular attention to the generation of production-grade code that avoids common pitfalls in parameter validation and error management. The benchmark also highlights that context window management remains a critical factor in successful complex code generation tasks.

Author: Eduardo Nicacio
Language: English
Keywords: Local LLM, Local LLM Benchmark, Code generation, T-SQL, C-sharp

Note: abstract generated by qwen/qwen3-4b-thinking-2507 via LM Studio RAG capabilities, i.e., extracting information from the existing Markdown files within this repository.

Methodology

Standardized Prompt Design

Create one comprehensive prompt that explicitly defines:

  • All technical requirements (SQL syntax, validation rules, error handling)
  • Expected output format and structure
  • Specific constraints (e.g., "use square brackets for column names")
  • Production readiness requirements (error logging, input validation)

Systematic Evaluation Framework

For each model and each coding task (T-SQL and C#), implement these checks:

Evaluation Area Specific Checks Pass/Fail Criteria
SQL Syntax Proper bracketing of reserved words Must use [brackets] for all reserved words (only Qwen models maintained this best practice)
Error Handling Error 50001-50004 implementation Must raise specific errors as defined in requirements
Input Validation Parameter validation as specified All null parameters must trigger error 50001
Production Readiness Error logging to dbo.DbError table Must implement TRY...CATCH with proper logging
Context Handling Full output without truncation No context window overflow issues
Code Quality Code organization, documentation Must follow .NET best practices

One-Pass Implementation Protocol

  • For each model, run the exact same prompt for both T-SQL generation and C# code generation (no prompt variations).

  • Capture all generated output in a standardized format.

  • Apply automated checks for:

    • SQL syntax validation using SQL linters
    • C# code correctness via .NET analyzers
    • Error handling implementation
    • Documentation completeness
    • Document any output truncation or context window issues

Critical Quality Metrics

Tracked metrics per model:

  • Success Rate: Percentage of requirements met
  • Error Rate: Number of errors in generated code
  • Context Limitation: Percentage of outputs truncated due to context window
  • Production Readiness Score: 0-10 scale based on production quality
  • Time to Generate: Average time for one complete generation

Analysis Focus Areas

This methodology specifically tracks:

  • How well models handle SQL reserved words (critical finding in the document)
  • Performance with complex requirements that test context limits
  • Consistency across different code components (SQL vs. C#)
  • Handling of edge cases like null values and date ranges

Platform Used

The study was conducted on a high-performance Windows system, consisting in:

This configuration represents a typical high-end local development environment for running LLMs.

LLM Models Chosen

The benchmark evaluated 14 models across different categories:

  1. Specialized Coding Models:

  2. General Purpose Models:

  3. Specialized Variants:

This methodology directly addresses one's one-pass constraint while ensuring meaningful results. By standardizing the prompt and evaluation criteria, one can fairly compare models without introducing bias from prompt variations. The focus on specific failure points (like the SQL reserved words issue highlighted in the document) ensures one captures the most critical weaknesses that affect real-world deployment.

The most valuable insight from this approach would be identifying which models consistently handle complex code generation without context truncation while maintaining proper SQL syntax and error handling - the key finding from the document that only the Cohere, Mistral-AI, OpenAI and Qwen models maintained proper bracketing for SQL reserved words. Qwen models have taken a step beyond by including proper bracketing for all the columns of the given table.

This methodology provides actionable insights for developers selecting the right local LLM for production code generation tasks without compromising quality.

Normalized prompt for T-SQL code generation (MS SQL Server 2022)

Below is the normalized prompt used for all the models (including Cohere Command):


Task:

Generate SQL stored procedures for the Activity table based on the following requirements. Use T-SQL syntax and ensure the code is production-ready.

Table Definition:

CREATE TABLE [dbo].[Activity](
    [ActivityId] [uniqueidentifier] NOT NULL,
    [ProjectId] [uniqueidentifier] NOT NULL,
    [ProjectMemberId] [uniqueidentifier] NOT NULL,
    [Name] [nvarchar](128) NOT NULL,
    [Description] [nvarchar](4000) NOT NULL,
    [StartDate] [date] NULL,
    [TargetDate] [date] NULL,
    [EndDate] [date] NULL,
    [ProgressStatus] [tinyint] NULL,
    [ActivityPoints] [smallint] NULL,
    [Priority] [tinyint] NULL,
    [Risk] [tinyint] NULL,
    [Tags] [nvarchar](200) NULL,
    [ActiveFlag] [tinyint] NOT NULL,
    [SystemDeleteFlag] [char](1) NOT NULL,
    [CreatedDateTime] [datetime2](7) NOT NULL,
    [CreatedByUser] [nvarchar](100) NOT NULL,
    [CreatedByProgram] [nvarchar](100) NOT NULL,
    [UpdatedDateTime] [datetime2](7) NULL,
    [UpdatedByUser] [nvarchar](100) NULL,
    [UpdatedByProgram] [nvarchar](100) NULL,
    [SystemTimestamp] [timestamp] NOT NULL,
    CONSTRAINT [PK_Activity_ActivityId] PRIMARY KEY CLUSTERED ([ActivityId] ASC)
) ON [PRIMARY];
GO

Stored Procedures to Create:

usp_ActivityInsert:

  • Input parameters: All columns except UpdatedDateTime, UpdatedByUser, UpdatedByProgram, and SystemTimestamp.
  • Explicitly include CreatedDateTime, CreatedByUser, and CreatedByProgram as input parameters.
  • Default nullable columns to NULL.

usp_ActivityDelete:

  • Input parameters: ActivityId, UpdatedDateTime, UpdatedByUser, UpdatedByProgram, and SystemTimestamp.
  • Soft-delete by setting SystemDeleteFlag = 'Y'.
  • Default UpdatedDateTime, UpdatedByUser, and UpdatedByProgram to SYSUTCDATETIME(), SYSTEM_USER, and APP_NAME() if null.
  • Use SystemTimestamp for optimistic locking.

usp_ActivityUpdate:

  • Input parameters: All columns except CreatedDateTime, CreatedByUser, and CreatedByProgram.
  • Default UpdatedDateTime, UpdatedByUser, and UpdatedByProgram to SYSUTCDATETIME(), SYSTEM_USER, and APP_NAME() if null.
  • Use SystemTimestamp for optimistic locking.

usp_ActivityRetrieve:

  • Input parameters: All columns (defaulted to NULL), except UpdatedDateTime, UpdatedByUser, and UpdatedByProgram.

  • Default ActiveFlag to 1 and SystemDeleteFlag to 'N'.

  • Use wildcard search for varchar and nvarchar columns:

    (@Parameter IS NULL OR CHARINDEX(@Parameter, [Column], 0) > 0)

  • Use date range search for date columns:

    (@DateTimeParameter IS NULL OR DateTimeColumn >= @DateTimeParameter AND DateTimeColumn < DATEADD(day, 1, @DateTimeParameter))

usp_ActivityRetrieveForList:

  • Retrieve ActivityId and Name for active, non-deleted records (ActiveFlag = 1 and SystemDeleteFlag <> 'Y').

General Requirements for All Stored Procedures:

  • Validate input parameters for type, size, and nullability.

  • Raise error 50001 for null parameters.

  • Raise error 50002 for string parameters exceeding column length.

  • Raise error 50003 for invalid ActiveFlag (must be 0 or 1) or SystemDeleteFlag (must be 'N' or 'Y').

  • Wrap SELECT, INSERT, and UPDATE statements in a TRY...CATCH block.

  • Log errors to dbo.DbError table in the CATCH block.

  • Raise error 50000 with message: "Error occurred during [Operation] operation."

  • For optimistic lock violations, raise error 50004 with message:

    "Operation failed because another user has updated or deleted this Activity. Your changes have been lost. Please review their changes before trying again."

  • Enclose each stored procedure in a separate SQL code block.

Output Format:

Provide each stored procedure in a separate code block, fully functional and ready to execute.


Normalized prompt for C# code generation

Below is the normalized prompt used for all the models, including Cohere Command:


Prompt:

Based on the following SQL table definition, generate the necessary C# components for a full-stack .NET application. The components should include:

  • C# Model: A class representing the Activity table, including all columns as properties with appropriate data types.
  • C# DTOs: Data Transfer Objects (DTOs) for Create, Update, and Read operations, mapping to the relevant fields in the Activity table.
  • C# View Models: View models for Create, Update, and List operations, tailored for use in Razor Pages.
  • Entity Framework Core Repository: A generic repository with CRUD methods (Get, Add, Update, Delete) and an Activity-specific repository implementation.
  • Minimal API: A Minimal API controller with endpoints for CRUD operations (GET, POST, PUT, DELETE) on the Activity table.
  • ASP.NET Razor Pages: Separate Razor Pages for Create, Edit, Details, Delete, and Index operations, including the necessary @page models and HTML forms.
  • Unit Tests (NUnit): Tests for models, DTOs, and view models to validate property assignments and data annotations.
  • Integration Tests (Moq): Tests for EF Core repository methods, Minimal API endpoints, and Razor Pages, using Moq to simulate database interactions.

Ensure all code follows .NET best practices, includes proper validation, and is ready for production use. Provide each component in separate code blocks with clear comments."

Table Definition for Reference:

CREATE TABLE [dbo].[Activity](
    [ActivityId] [uniqueidentifier] NOT NULL,
    [ProjectId] [uniqueidentifier] NOT NULL,
    [ProjectMemberId] [uniqueidentifier] NOT NULL,
    [Name] [nvarchar](128) NOT NULL,
    [Description] [nvarchar](4000) NOT NULL,
    [StartDate] [date] NULL,
    [TargetDate] [date] NULL,
    [EndDate] [date] NULL,
    [ProgressStatus] [tinyint] NULL,
    [ActivityPoints] [smallint] NULL,
    [Priority] [tinyint] NULL,
    [Risk] [tinyint] NULL,
    [Tags] [nvarchar](200) NULL,
    [ActiveFlag] [tinyint] NOT NULL,
    [SystemDeleteFlag] [char](1) NOT NULL,
    [CreatedDateTime] [datetime2](7) NOT NULL,
    [CreatedByUser] [nvarchar](100) NOT NULL,
    [CreatedByProgram] [nvarchar](100) NOT NULL,
    [UpdatedDateTime] [datetime2](7) NULL,
    [UpdatedByUser] [nvarchar](100) NULL,
    [UpdatedByProgram] [nvarchar](100) NULL,
    [SystemTimestamp] [timestamp] NOT NULL,
    CONSTRAINT [PK_Activity_ActivityId] PRIMARY KEY CLUSTERED ([ActivityId] ASC)
) ON [PRIMARY];
GO

Expected Output Structure:

  • C# Model
  • C# DTOs (CreateDto, UpdateDto, ReadDto)
  • C# View Models (CreateViewModel, UpdateViewModel, ListViewModel)
  • EF Core Repository (Generic Repository, Activity Repository)
  • Minimal API (CRUD Endpoints)
  • Razor Pages (Create.cshtml, Edit.cshtml, Details.cshtml, Delete.cshtml, Index.cshtml, and their respective PageModels)
  • Unit Tests (NUnit)
  • Integration Tests (Moq)

Provide each component in a well-organized, production-ready format.


Results

[placeholder]

T-SQL Generation

[placeholder]

C# Models

[placeholder]

C# DTOs

[placeholder]

C# ViewModels

[placeholder]

C# Repositories

[placeholder]

Asp.Net Core Minimal APIs

[placeholder]

Asp.Net Core Razor Page(s)

[placeholder]

Unit Tests results

[placeholder]

Integration Tests results

[placeholder]

Comments

When it comes to run LLMs locally, choosing the right model for a given task is probably one of the most important decisions one can make. While some models are simply brilliant at coding tasks, other can't complete the tasks even when explicitly instructed about what to do.

[placeholder]

With all that said, choosing the right model for the tasks above allowed me to refactor an old Asp.Net project of mine as follows:

  • Data layer: 20h (SQL Server 2022 db with 67 tables and 146 stored procedures, totalling 17,000 lines of code).
  • C# Models: TBD
  • C# DTOs: TBD
  • C# ViewModels: TBD
  • C# Repositories: TBD
  • Asp.Net Core Minimal Api: TBD
  • Asp.Net Code Razor pages: TBD
  • Unit tests: TBD
  • Integration tests: TBD

Useful links

Found these findings useful? Leave a ⭐.

About

Small/Medium LLM Benchmarks using LM Studio

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published