Add Support for ByteSize and TimeDuration Token Parsers with Aggregation Directive #984

Chayandas07 · 2025-04-16T15:46:12Z

Summary

This pull request introduces new token parsers and a directive to support parsing and aggregation of byte size and time duration values in Wrangler recipes.

Key Features

✅ Lexer & Parser Enhancements

Added new lexer rules:
- BYTE_SIZE with support for units: B, KB, MB, GB, TB, PB
- TIME_DURATION with support for units: ns, us, ms, s, m, h, d
Updated Directives.g4 grammar to include:
- ByteUnit, TimeUnit fragments
- New parser rules: byteSizeArg, timeDurationArg
Regenerated ANTLR lexer and parser with mvn compile

✅ API Updates

Added new classes:
- ByteSize.java: Parses and normalizes byte sizes
- TimeDuration.java: Parses and converts durations to milliseconds
Extended TokenType with BYTE_SIZE and TIME_DURATION
Updated relevant API components to accept these token types

✅ Core Parser Integration

Implemented visitByteSizeArg and visitTimeDurationArg in the recipe visitor
Tokens are added to the directive execution pipeline via the builder

✅ New Directive: aggregate-stats

Accepts four arguments:
1. Source column (byte size)
2. Source column (time duration)
3. Target column (aggregated size)
4. Target column (aggregated time)
Supports aggregation logic:
- Summing byte sizes and time durations across rows
- Optional unit conversion (e.g., bytes → MB, ms → seconds)
Uses ExecutorContext store to maintain intermediate state for aggregation

✅ Testing

Unit tests for:
- ByteSize parsing (e.g., "2KB", "1.5MB")
- TimeDuration parsing (e.g., "100ms", "2.5h", "3min")
Grammar parser tests ensure correct and invalid usage of the new syntax
Directive tests verify:
- Aggregation logic correctness
- Accurate conversions and results with floating-point tolerance

Usage (Added to README)

aggregate-stats :transfer_size :duration total_size_mb total_time_sec

- Introduced BYTE_SIZE and TIME_DURATION tokens in the ANTLR grammar. - Implemented ByteSize and TimeDuration classes for parsing and conversion. - Updated RecipeVisitor to handle new token types in directives. - Created AggregateStats directive for aggregating byte sizes and time durations. - Added unit tests for ByteSize and TimeDuration parsing. - Implemented integration tests for AggregateStats directive. - Updated pom.xml files to include necessary dependencies and plugins.

google-cla · 2025-04-16T15:46:17Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Chayandas07 and others added 5 commits April 16, 2025 20:23

fix: Remove unnecessary lines in prompts.text for clarity

fcb2ec4

test: add unit tests for ByteSize and TimeDuration parsing logic

48df196

done

6ce2875

done

6b9d7aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Support for ByteSize and TimeDuration Token Parsers with Aggregation Directive #984

Add Support for ByteSize and TimeDuration Token Parsers with Aggregation Directive #984

Uh oh!

Chayandas07 commented Apr 16, 2025

Uh oh!

google-cla bot commented Apr 16, 2025

Uh oh!

Uh oh!

Add Support for ByteSize and TimeDuration Token Parsers with Aggregation Directive #984

Are you sure you want to change the base?

Add Support for ByteSize and TimeDuration Token Parsers with Aggregation Directive #984

Uh oh!

Conversation

Chayandas07 commented Apr 16, 2025

Summary

Key Features

Usage (Added to README)

Uh oh!

google-cla bot commented Apr 16, 2025

Uh oh!

Uh oh!