Pipeline to generate SQL queries #58

lchoquel · 2025-07-14T14:00:59Z

lchoquel
Jul 14, 2025
Maintainer

Inputs

List of tables and their columns
Natural language query

Work

Extract typical SQL features such as filters, aggregations and columns
Apply the features using a Jinja2 template

Output

SQL text query

Mayank-cyber-cell · 2025-07-22T15:54:40Z

Mayank-cyber-cell
Jul 22, 2025

✅ Inputs:
List of Tables & Columns (Schema)

Example:

json

{
"orders": ["id", "user_id", "amount", "created_at"],
"users": ["id", "name", "email"]
}
Natural Language Query

Example:
"Get the total amount of orders placed by each user in the last 30 days"

🔍 Work:

Parse Natural Language → Extract SQL Features
Tables involved: orders, users

Columns required: user_id, amount, created_at, name

Filters: created_at >= NOW() - INTERVAL 30 DAY

Aggregations: SUM(amount)

Group By: user_id

Joins (if required): Join orders.user_id with users.id

Template-Based SQL Generation (Jinja2)
Use Jinja2 templates to assemble the SQL using extracted features:

jinja2

SELECT
users.name,
SUM(orders.amount) AS total_amount
FROM orders
JOIN users ON orders.user_id = users.id
WHERE orders.created_at >= NOW() - INTERVAL 30 DAY
GROUP BY users.name
ORDER BY total_amount DESC
LIMIT 10;
🧾 Output:
Generated SQL Query (String)
Example:

SELECT users.name, SUM(orders.amount) AS total_amount
FROM orders
JOIN users ON orders.user_id = users.id
WHERE orders.created_at >= NOW() - INTERVAL 30 DAY
GROUP BY users.name
ORDER BY total_amount DESC
LIMIT 10;

0 replies

lchoquel · 2025-07-22T17:24:16Z

lchoquel
Jul 22, 2025
Maintainer Author

Hey @Mayank-cyber-cell, thanks for fleshing out the example! This gives us a good example to move forward: simple but significant.

Quick note on the Jinja2 part: what you've shown is the final SQL output, but we need the actual template with placeholders. Something like:

SELECT 
    {{ select_columns | join(', ') }}
FROM {{ main_table }}
{% for join in joins %}
JOIN {{ join.table }} ON {{ join.condition }}
{% endfor %}
{% if filters %}
WHERE {{ filters | join(' AND ') }}
{% endif %}
{% if group_by %}
GROUP BY {{ group_by | join(', ') }}
{% endif %}

Now there are tricky parts:

Templating approach: Let's start constrained. Instead of trying to handle all SQL, we create specific templates for common patterns:

aggregate_by_dimension.sql.jinja2
time_series_analysis.sql.jinja2
simple_lookup.sql.jinja2

Each template handles its use case well, and the pipeline picks the right one based on the extracted features.

Filter generation (especially that created_at >= NOW() - INTERVAL 30 DAY): This is where things get interesting. The easy path is letting an LLM generate the SQL directly, but that's basically handing matches to a pyromaniac. Imagine a user asking: "show me orders from last month'; DROP TABLE users; --". If your LLM happily generates that SQL, you're gonna have a bad time.

Even with "safe" queries, LLMs can hallucinate table names or create queries that look right but subtly corrupt your data. SQL injection is old news, but prompt injection is the new kid on the block doing the same damage with a fresh coat of paint.

So here's where we can show the power of deterministic pipelines. Like SaaS filtering GUIs, we could:

Extract the intent as structured data:

{
  "time_filter": {
    "column": "created_at",
    "operator": ">=",
    "reference": "now",
    "offset": {
      "value": 30,
      "unit": "days", 
      "direction": "past"
    }
  }
}

Or we could map to pre-defined safe patterns:

   TIME_PATTERNS = {
     "mysql": "{{ column }} >= NOW() - INTERVAL {{ value }} {{ unit }}",
     "postgres": "{{ column }} >= CURRENT_DATE - INTERVAL '{{ value }} {{ unit }}'",
     # ...
   }

This gives us dropdown-style safety with natural language flexibility. No arbitrary code generation, no prompt injection risks, just deterministic transformation. I prefer solution (1) because we can have a common model for the building blocks and just use templates to adapt to each SQL dialect. And I like the idea of having enumerated values for the possibilities (operator, reference…).

Want to collaborate on building this out as a proper pipeline? We could start with the time-series aggregation case and expand from there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline to generate SQL queries #58

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Pipeline to generate SQL queries #58

Uh oh!

lchoquel Jul 14, 2025 Maintainer

Replies: 2 comments

Uh oh!

Mayank-cyber-cell Jul 22, 2025

Uh oh!

lchoquel Jul 22, 2025 Maintainer Author

lchoquel
Jul 14, 2025
Maintainer

Mayank-cyber-cell
Jul 22, 2025

lchoquel
Jul 22, 2025
Maintainer Author