Skip to content

Conversation

srielau
Copy link
Contributor

@srielau srielau commented Sep 13, 2025

What changes were proposed in this pull request?

This PR introduces a comprehensive parameter substitution system that significantly expands parameter marker support and implements major performance optimizations. The current implementation of parameter markers (:param and ?) is restricted to expressions and queries. This PR enables parameter markers to work universally across all SQL constructs.

Key Changes

Enhanced Parameter Substitution Architecture:

  • Added ParameterHandler class with centralized, optimized parameter substitution logic
  • Integrated parameter substitution directly into SparkSqlParser with parser-aware error context
  • Implemented ParameterSubstitutionContext with thread-local substitution information
  • Added sophisticated position mapping for accurate error reporting in original SQL text

EXECUTE IMMEDIATE Integration & Error Context:

  • Enhanced ResolveExecuteImmediate with comprehensive parameter validation and local variable isolation
  • Added HybridParameterContext supporting both named and positional parameters with strict validation
  • Implemented parser-aware error context that correctly maps errors back to original SQL before parameter substitution
  • Added "firewall" mechanism preventing EXECUTE IMMEDIATE from accessing outer SQL scripting variables

Legacy Mode Support:

  • Maintained full backward compatibility with spark.sql.legacy.parameterSubstitution.constantsOnly
  • In legacy mode: parameter substitution disabled, analyzer-based binding preserved
  • In new mode: full text-based parameter substitution with enhanced validation

Why are the changes needed?

Functional Limitations:

  1. Limited scope: Parameter markers only worked in expressions/queries, not DDL/utility statements
  2. Analyzer dependency: Literals bypass analyzer, making parameter expansion impossible

The enhanced architecture solves these by:

  • Universal parameter support through pre-parsing substitution
  • Parser-aware error context maintaining accurate error positions
  • High-performance algorithms optimized for large SQL texts
  • Comprehensive validation preventing common parameter usage errors

Does this PR introduce any user-facing changes?

Yes - Significantly Expanded Functionality:

New Parameter Support:

  • DDL statements: CREATE VIEW v(c1) AS (SELECT :p)
  • Utility statements: SHOW TABLES FROM schema LIKE :pattern
  • Type definitions: DECIMAL(?, ?)
  • Table properties: TBLPROPERTIES (':key' = ':value')
  • All SQL constructs where literals are valid

Full Backward Compatibility:

  • All existing parameter usage continues working unchanged
  • Legacy mode behavior completely preserved
  • No breaking changes to existing APIs

How was this patch tested?

Comprehensive Unit Testing:

  • Enhanced ParametersSuite with 50+ test cases covering all parameter scenarios
  • Added ParameterSubstitutionSuite for algorithm-specific testing
  • SqlScriptingExecutionSuite tests for EXECUTE IMMEDIATE isolation
  • Position mapping accuracy tests ensuring correct error reporting

Integration Testing:

  • Cross-validated legacy vs modern mode behavior consistency
  • EXECUTE IMMEDIATE parameter validation across all scenarios
  • Error context accuracy for nested parameter substitution
  • Thread safety validation for concurrent parameter usage

Manual Verification:

  • DDL parameter substitution: CREATE TABLE :name (id INT)
  • Complex parameter scenarios: EXECUTE IMMEDIATE 'SELECT typeof(:p), :p' USING 5::INT AS p
  • Error position accuracy in multi-line SQL with parameters
  • Performance benchmarking on large SQL scripts

Example Enhanced Capabilities:

// Universal parameter support
spark.sql("CREATE TABLE :table_name (id :type)", 
  Map("table_name" -> "users", "type" -> "BIGINT"))

// Enhanced validation with accurate error context
spark.sql("EXECUTE IMMEDIATE 'SELECT :p' USING 1, 2 AS p") 
// Fails with ALL_PARAMETERS_MUST_BE_NAMED, showing exact position

// Performance: handles large SQL with many parameters efficiently
spark.sql(largeSqlWithManyParameters, parameterMap) // Now O(k) instead of O(n²)

Quality Assurance:

  • All scalastyle checks pass
  • No compilation warnings or deprecation issues
  • Thread safety verified through concurrent testing
  • Memory usage profiling confirms optimization effectiveness

@srielau srielau changed the title [WIP][SQL] Use Pre-processor for generalized parameter marker handling [SPARK-53573][WIP][SQL] Use Pre-processor for generalized parameter marker handling Sep 13, 2025
@srielau srielau marked this pull request as draft September 13, 2025 21:51
@srielau srielau mentioned this pull request Sep 16, 2025
@srielau srielau marked this pull request as ready for review September 30, 2025 15:04
@srielau srielau changed the title [SPARK-53573][WIP][SQL] Use Pre-processor for generalized parameter marker handling [SPARK-53573][SQL] Use Pre-processor for generalized parameter marker handling Sep 30, 2025
This commit implements a comprehensive parameter substitution system for Spark SQL
that provides detailed error context for EXECUTE IMMEDIATE statements, similar to
how views handle errors.

Key features:
- Pre-parser approach with position mapping for accurate error reporting
- Thread-local parameter context for parser-aware error handling
- Support for both named and positional parameters across all SQL APIs
- Optimized position mapping algorithm (O(n²) → O(k) where k = substitutions)
- Comprehensive test coverage including edge cases and error scenarios
- Backward compatibility with legacy parameter substitution mode

The implementation includes:
- ParameterHandler for unified parameter processing
- PositionMapper for efficient error position translation
- LiteralToSqlConverter for type-safe SQL generation
- Integration with SparkSqlParser, SparkSession, and EXECUTE IMMEDIATE
- Enhanced error messages showing both outer and inner query context

This addresses the user request for detailed error context in EXECUTE IMMEDIATE
statements while maintaining performance and compatibility.
@srielau srielau force-pushed the preparser-squashed branch from efbd9cf to 4fe9577 Compare October 15, 2025 14:00
@srielau srielau force-pushed the preparser-squashed branch from a238338 to 20e66b0 Compare October 15, 2025 18:30
@dtenedor
Copy link
Contributor

We reviewed this extensively, the architecture is sound and the test coverage is thorough. The one remaining CI failure is OracleIntegrationSuite which is flaky and not related to this PR. Merging to master.

@dtenedor dtenedor closed this in 983d384 Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants