How to leverage date partition pruning? #253

MatinF · 2024-05-02T13:08:55Z

I have a Google Cloud Storage bucket that contains a Parquet data lake structured as below:

bucket/deviceid/message/yyyy/mm/dd/xyz.parquet

I aim to use BigQuery to create external tables to enable serverless SQL queries from this bucket (with the end goal of visualizing the data in Grafana dashboards). To do so I wish to leverage partitioning.

As evident, my data is not hive-partitioned. In my previous setup in Amazon Athena + Grafana I managed to get around this via my AWS Glue Job, where I could define my partitioning logic as below:

# Define the partition keys and other table properties
    partition_keys = [{"Name": "date_created", "Type": "string"}]
    sample_file_key_parts = sample_file_key.split("/")
    s3_table_path = "/".join(sample_file_key_parts[:2])
    date_path = "${date_created}"
    full_s3_path = f"s3://{bucket_output}/{s3_table_path}/{date_path}/"

    projection_props = {
        "typeOfData": "file",
        "classification": "parquet",
        "partition_filtering.enabled": "true",
        "projection.enabled": "true",
        "projection.date_created.type": "date",
        "projection.date_created.format": "yyyy/MM/dd",
        "projection.date_created.range": "2019/01/01,NOW",
        "projection.date_created.interval": "1",
        "projection.date_created.interval.unit": "DAYS",
        "storage.location.template": full_s3_path,
    }

Is there an equivalent way to define custom partitioning in Google BigQuery? I'm OK with creating my external tables via a Python script as I will anyway need to do so at scale.

In the ideal scenario, I would be able to perform my partition pruning similarly to Amazon Athena, where I can add a filter as below:

WHERE date_created between '2020-04-22' and '2021-01-19'

I tried achieving something similar on-the-fly in my BigQuery SQL, but from what I can tell this does not effectively prune partitions and will therefore not be of practical use:

SELECT 
  DATE(PARSE_DATE('%Y/%m/%d', REGEXP_EXTRACT(_FILE_NAME, r'.*/(\d{4}/\d{2}/\d{2})/.*'))) AS event_date,
  COUNT(AccelerationX) as cnt_accx
FROM 
  cssdataset2.tbl_0BFD7754_CAN9_ImuData 
WHERE 
  DATE(PARSE_DATE('%Y/%m/%d', REGEXP_EXTRACT(_FILE_NAME, r'.*/(\d{4}/\d{2}/\d{2})/.*'))) BETWEEN '2023-06-22' AND '2023-06-22'
GROUP BY
  event_date

I have also considered switching my data structure to hive partitions ala year=yyyy/month=mm/day=dd/ - but even then I seem unable to gain something similar to the Athena filter in my SQL queries, which seems vital to me for a Grafana plugin integration with time series data, as we will otherwise not have a simple dynamic way to prune irrelevant dates from the queries. Is the only practical way of achieving this to use e.g. a /dt=yyyy-mm-dd/ folder structure?

The text was updated successfully, but these errors were encountered:

gabor · 2024-05-21T11:51:04Z

hi @MatinF , sorry, i do not know the answer to the question. as this seems to be a how-to-do-x question, and not a bugreport or a feature-request, i recommend asking about it in the grafana community, at https://community.grafana.com/tag/bigquery .

if i misunderstood, and this is something that the grafana bigquery plugin could/should directly support, just reply please with more details, and we can reopen the issue.

grafanabot added this to OSS Big Tent May 2, 2024

gabor closed this as completed May 21, 2024

github-project-automation bot moved this to Complete in OSS Big Tent May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to leverage date partition pruning? #253

How to leverage date partition pruning? #253

MatinF commented May 2, 2024 •

edited

Loading

gabor commented May 21, 2024

Uh oh!

How to leverage date partition pruning? #253

How to leverage date partition pruning? #253

Comments

MatinF commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gabor commented May 21, 2024

Uh oh!

MatinF commented May 2, 2024 •

edited

Loading