Skip to content

dbt run hangs for minutes when an Iceberg catalog is attached #640

@bitadmiral

Description

@bitadmiral

Whether a secondary catalog is attached or not, dbt run and dbt compile must execute the following SQL query to list tables in the local DuckDB database schema. However, if there is an Iceberg catalog attached, then this query can take minutes long.

select
      '<duckdb_database>' as database,
      table_name as name,
      table_schema as schema,
      CASE table_type
        WHEN 'BASE TABLE' THEN 'table'
        WHEN 'VIEW' THEN 'view'
        WHEN 'LOCAL TEMPORARY' THEN 'table'
        END as type
    from system.information_schema.tables
    where lower(table_schema) = '<duckdb_schema>'
    and lower(table_catalog) = '<duckdb_database>'

Below is an excerpt from the dbt log showing this query taking over 90 seconds to complete. We have a modestly populated Glue catalog in our AWS account where we develop pipelines with dbt. All our dbt-DuckDB users report this problem consistently (if their project attaches to the Glue catalog). This makes it very difficult to develop their dbt project because every change to a model is usually followed by a quick test run to check the output. Users are opting to completely forego this integration with Iceberg catalogs because they can't develop in rapid iterations.

18:50:41  Opening a new connection, currently in state closed
18:50:41  SQL status: OK in 0.001 seconds
18:50:41  Using duckdb connection "list_vocab_ingest_v1_4_0_main"
18:50:41  On list_vocab_ingest_v1_4_0_main: /* {"app": "dbt", "dbt_version": "1.10.11", "profile_name": "dbt_vocab_ingest", "target_name": "dev", "connection_name": "list_vocab_ingest_v1_4_0_main"} */
select
      'vocab_ingest_v1_4_0' as database,
      table_name as name,
      table_schema as schema,
      CASE table_type
        WHEN 'BASE TABLE' THEN 'table'
        WHEN 'VIEW' THEN 'view'
        WHEN 'LOCAL TEMPORARY' THEN 'table'
        END as type
    from system.information_schema.tables
    where lower(table_schema) = 'main'
    and lower(table_catalog) = 'vocab_ingest_v1_4_0'
  
18:52:13  SQL status: OK in 92.597 seconds
18:52:13  On list_vocab_ingest_v1_4_0_main: ROLLBACK
18:52:13  Failed to rollback 'list_vocab_ingest_v1_4_0_main'
18:52:13  On list_vocab_ingest_v1_4_0_main: Close
18:52:13  Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'dc2ca9af-81db-49ba-9597-80da292a9ab6', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x770194c9ffe0>]}
18:52:13  Using duckdb connection "master"
18:52:13  On master: BEGIN
18:52:13  Opening a new connection, currently in state init
18:52:13  SQL status: OK in 0.001 seconds

Just to give an idea of what would be returned by query above if it were unfiltered, here are counts of tables and schemas by catalog and table type:

SELECT 
    table_catalog,
    table_type,
    COUNT(DISTINCT table_schema) AS "schema_count",
    COUNT(DISTINCT table_name) AS "table_count",
FROM system.information_schema.tables
GROUP BY 1,2
ORDER BY 1,2
;
table_catalog table_type schema_count table_count
glue_catalog BASE TABLE 113 14198
vocab_ingest_v1_4_0 BASE TABLE 2 4

In the test above, the local DuckDB database name is vocab_ingest_v1_4_0, so the filtered query returns only 4 tables, even though the query takes a minute and a half.

The dbt configurations used to attach the Iceberg catalog can be seen in Issue #639.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions