[SPARK-53573][SQL] Use Pre-processor for generalized parameter marker handling #52334

srielau · 2025-09-13T20:42:50Z

What changes were proposed in this pull request?

This PR introduces a comprehensive parameter substitution system that significantly expands parameter marker support and implements major performance optimizations. The current implementation of parameter markers (:param and ?) is restricted to expressions and queries. This PR enables parameter markers to work universally across all SQL constructs.

Key Changes

Enhanced Parameter Substitution Architecture:

Added ParameterHandler class with centralized, optimized parameter substitution logic
Integrated parameter substitution directly into SparkSqlParser with parser-aware error context
Implemented ParameterSubstitutionContext with thread-local substitution information
Added sophisticated position mapping for accurate error reporting in original SQL text

EXECUTE IMMEDIATE Integration & Error Context:

Enhanced ResolveExecuteImmediate with comprehensive parameter validation and local variable isolation
Added HybridParameterContext supporting both named and positional parameters with strict validation
Implemented parser-aware error context that correctly maps errors back to original SQL before parameter substitution
Added "firewall" mechanism preventing EXECUTE IMMEDIATE from accessing outer SQL scripting variables

Legacy Mode Support:

Maintained full backward compatibility with spark.sql.legacy.parameterSubstitution.constantsOnly
In legacy mode: parameter substitution disabled, analyzer-based binding preserved
In new mode: full text-based parameter substitution with enhanced validation

Why are the changes needed?

Functional Limitations:

Limited scope: Parameter markers only worked in expressions/queries, not DDL/utility statements
Analyzer dependency: Literals bypass analyzer, making parameter expansion impossible

The enhanced architecture solves these by:

Universal parameter support through pre-parsing substitution
Parser-aware error context maintaining accurate error positions
High-performance algorithms optimized for large SQL texts
Comprehensive validation preventing common parameter usage errors

Does this PR introduce any user-facing changes?

Yes - Significantly Expanded Functionality:

New Parameter Support:

DDL statements: CREATE VIEW v(c1) AS (SELECT :p)
Utility statements: SHOW TABLES FROM schema LIKE :pattern
Type definitions: DECIMAL(?, ?)
Table properties: TBLPROPERTIES (':key' = ':value')
All SQL constructs where literals are valid

Full Backward Compatibility:

All existing parameter usage continues working unchanged
Legacy mode behavior completely preserved
No breaking changes to existing APIs

How was this patch tested?

Comprehensive Unit Testing:

Enhanced ParametersSuite with 50+ test cases covering all parameter scenarios
Added ParameterSubstitutionSuite for algorithm-specific testing
SqlScriptingExecutionSuite tests for EXECUTE IMMEDIATE isolation
Position mapping accuracy tests ensuring correct error reporting

Integration Testing:

Cross-validated legacy vs modern mode behavior consistency
EXECUTE IMMEDIATE parameter validation across all scenarios
Error context accuracy for nested parameter substitution
Thread safety validation for concurrent parameter usage

Manual Verification:

DDL parameter substitution: CREATE TABLE :name (id INT)
Complex parameter scenarios: EXECUTE IMMEDIATE 'SELECT typeof(:p), :p' USING 5::INT AS p
Error position accuracy in multi-line SQL with parameters
Performance benchmarking on large SQL scripts

Example Enhanced Capabilities:

// Universal parameter support
spark.sql("CREATE TABLE :table_name (id :type)", 
  Map("table_name" -> "users", "type" -> "BIGINT"))

// Enhanced validation with accurate error context
spark.sql("EXECUTE IMMEDIATE 'SELECT :p' USING 1, 2 AS p") 
// Fails with ALL_PARAMETERS_MUST_BE_NAMED, showing exact position

// Performance: handles large SQL with many parameters efficiently
spark.sql(largeSqlWithManyParameters, parameterMap) // Now O(k) instead of O(n²)

Quality Assurance:

All scalastyle checks pass
No compilation warnings or deprecation issues
Thread safety verified through concurrent testing
Memory usage profiling confirms optimization effectiveness

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/DataTypeAstBuilder.scala

sql/api/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParmsAstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterContext.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/LiteralToSqlConverter.scala

sql/core/src/main/scala/org/apache/spark/sql/classic/SparkSession.scala

This commit implements a comprehensive parameter substitution system for Spark SQL that provides detailed error context for EXECUTE IMMEDIATE statements, similar to how views handle errors. Key features: - Pre-parser approach with position mapping for accurate error reporting - Thread-local parameter context for parser-aware error handling - Support for both named and positional parameters across all SQL APIs - Optimized position mapping algorithm (O(n²) → O(k) where k = substitutions) - Comprehensive test coverage including edge cases and error scenarios - Backward compatibility with legacy parameter substitution mode The implementation includes: - ParameterHandler for unified parameter processing - PositionMapper for efficient error position translation - LiteralToSqlConverter for type-safe SQL generation - Integration with SparkSqlParser, SparkSession, and EXECUTE IMMEDIATE - Enhanced error messages showing both outer and inner query context This addresses the user request for detailed error context in EXECUTE IMMEDIATE statements while maintaining performance and compatibility.

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkParserUtils.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterHandler.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParameterContext.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/SubstituteParamsParser.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/QueryPlanSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/classic/SparkSession.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

dtenedor · 2025-10-16T21:19:43Z

We reviewed this extensively, the architecture is sound and the test coverage is thorough. The one remaining CI failure is OracleIntegrationSuite which is flaky and not related to this PR. Merging to master.

github-actions bot added SQL CONNECT labels Sep 13, 2025

srielau changed the title ~~[WIP][SQL] Use Pre-processor for generalized parameter marker handling~~ [SPARK-53573][WIP][SQL] Use Pre-processor for generalized parameter marker handling Sep 13, 2025

srielau marked this pull request as draft September 13, 2025 21:51

srielau mentioned this pull request Sep 16, 2025

[WIP] Preparser #52085

Closed