Skip to content

Add Support for ByteSize and TimeDuration Token Parsers with Aggregation Directive #984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

Chayandas07
Copy link

Summary

This pull request introduces new token parsers and a directive to support parsing and aggregation of byte size and time duration values in Wrangler recipes.


Key Features

Lexer & Parser Enhancements

  • Added new lexer rules:
    • BYTE_SIZE with support for units: B, KB, MB, GB, TB, PB
    • TIME_DURATION with support for units: ns, us, ms, s, m, h, d
  • Updated Directives.g4 grammar to include:
    • ByteUnit, TimeUnit fragments
    • New parser rules: byteSizeArg, timeDurationArg
  • Regenerated ANTLR lexer and parser with mvn compile

API Updates

  • Added new classes:
    • ByteSize.java: Parses and normalizes byte sizes
    • TimeDuration.java: Parses and converts durations to milliseconds
  • Extended TokenType with BYTE_SIZE and TIME_DURATION
  • Updated relevant API components to accept these token types

Core Parser Integration

  • Implemented visitByteSizeArg and visitTimeDurationArg in the recipe visitor
  • Tokens are added to the directive execution pipeline via the builder

New Directive: aggregate-stats

  • Accepts four arguments:
    1. Source column (byte size)
    2. Source column (time duration)
    3. Target column (aggregated size)
    4. Target column (aggregated time)
  • Supports aggregation logic:
    • Summing byte sizes and time durations across rows
    • Optional unit conversion (e.g., bytes → MB, ms → seconds)
  • Uses ExecutorContext store to maintain intermediate state for aggregation

Testing

  • Unit tests for:
    • ByteSize parsing (e.g., "2KB", "1.5MB")
    • TimeDuration parsing (e.g., "100ms", "2.5h", "3min")
  • Grammar parser tests ensure correct and invalid usage of the new syntax
  • Directive tests verify:
    • Aggregation logic correctness
    • Accurate conversions and results with floating-point tolerance

Usage (Added to README)

aggregate-stats :transfer_size :duration total_size_mb total_time_sec

Chayandas07 and others added 5 commits April 16, 2025 20:23
- Introduced BYTE_SIZE and TIME_DURATION tokens in the ANTLR grammar.
- Implemented ByteSize and TimeDuration classes for parsing and conversion.
- Updated RecipeVisitor to handle new token types in directives.
- Created AggregateStats directive for aggregating byte sizes and time durations.
- Added unit tests for ByteSize and TimeDuration parsing.
- Implemented integration tests for AggregateStats directive.
- Updated pom.xml files to include necessary dependencies and plugins.
Copy link

google-cla bot commented Apr 16, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant