Skip to content

How to leverage date partition pruning? #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MatinF opened this issue May 2, 2024 · 1 comment
Closed

How to leverage date partition pruning? #253

MatinF opened this issue May 2, 2024 · 1 comment

Comments

@MatinF
Copy link

MatinF commented May 2, 2024

I have a Google Cloud Storage bucket that contains a Parquet data lake structured as below:

bucket/deviceid/message/yyyy/mm/dd/xyz.parquet

I aim to use BigQuery to create external tables to enable serverless SQL queries from this bucket (with the end goal of visualizing the data in Grafana dashboards). To do so I wish to leverage partitioning.

As evident, my data is not hive-partitioned. In my previous setup in Amazon Athena + Grafana I managed to get around this via my AWS Glue Job, where I could define my partitioning logic as below:

# Define the partition keys and other table properties
    partition_keys = [{"Name": "date_created", "Type": "string"}]
    sample_file_key_parts = sample_file_key.split("/")
    s3_table_path = "/".join(sample_file_key_parts[:2])
    date_path = "${date_created}"
    full_s3_path = f"s3://{bucket_output}/{s3_table_path}/{date_path}/"

    projection_props = {
        "typeOfData": "file",
        "classification": "parquet",
        "partition_filtering.enabled": "true",
        "projection.enabled": "true",
        "projection.date_created.type": "date",
        "projection.date_created.format": "yyyy/MM/dd",
        "projection.date_created.range": "2019/01/01,NOW",
        "projection.date_created.interval": "1",
        "projection.date_created.interval.unit": "DAYS",
        "storage.location.template": full_s3_path,
    }

Is there an equivalent way to define custom partitioning in Google BigQuery? I'm OK with creating my external tables via a Python script as I will anyway need to do so at scale.

In the ideal scenario, I would be able to perform my partition pruning similarly to Amazon Athena, where I can add a filter as below:

WHERE date_created between '2020-04-22' and '2021-01-19'

I tried achieving something similar on-the-fly in my BigQuery SQL, but from what I can tell this does not effectively prune partitions and will therefore not be of practical use:

SELECT 
  DATE(PARSE_DATE('%Y/%m/%d', REGEXP_EXTRACT(_FILE_NAME, r'.*/(\d{4}/\d{2}/\d{2})/.*'))) AS event_date,
  COUNT(AccelerationX) as cnt_accx
FROM 
  cssdataset2.tbl_0BFD7754_CAN9_ImuData 
WHERE 
  DATE(PARSE_DATE('%Y/%m/%d', REGEXP_EXTRACT(_FILE_NAME, r'.*/(\d{4}/\d{2}/\d{2})/.*'))) BETWEEN '2023-06-22' AND '2023-06-22'
GROUP BY
  event_date

I have also considered switching my data structure to hive partitions ala year=yyyy/month=mm/day=dd/ - but even then I seem unable to gain something similar to the Athena filter in my SQL queries, which seems vital to me for a Grafana plugin integration with time series data, as we will otherwise not have a simple dynamic way to prune irrelevant dates from the queries. Is the only practical way of achieving this to use e.g. a /dt=yyyy-mm-dd/ folder structure?

@gabor
Copy link

gabor commented May 21, 2024

hi @MatinF , sorry, i do not know the answer to the question. as this seems to be a how-to-do-x question, and not a bugreport or a feature-request, i recommend asking about it in the grafana community, at https://community.grafana.com/tag/bigquery .

if i misunderstood, and this is something that the grafana bigquery plugin could/should directly support, just reply please with more details, and we can reopen the issue.

@gabor gabor closed this as completed May 21, 2024
@github-project-automation github-project-automation bot moved this to Complete in OSS Big Tent May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Complete
Development

No branches or pull requests

2 participants