Skip to content

Fixed ByteSize & TimeDuration Tokens and Aggregation Directive #992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions prompts.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# AI Tooling Prompts Used

Prompt 1 : Initial Implementation
I am modifying an ANTLR grammar for a Java project. Help me write lexer and parser rules in Directives.g4 file for
two new tokens:

BYTE_SIZE → matches values like "10KB", "2.5MB", "1GB".

TIME_DURATION → matches values like "5ms", "3.2s", "1min".

Also, include helpful fragments like BYTE_UNIT, TIME_UNIT. Finally, show how to update the value parser rule
(or create byteSizeArg, timeDurationArg if needed) so the new tokens are accepted as directive arguments.



Prompt 2: Create ByteSize and TimeDuration Token Classes
Help me in building two new token classes:

ByteSize.java and TimeDuration.java

Each class should:

1. Extension of io.cdap.wrangler.api.parser.Token

2. Parse strings like "10KB", "2.5MB", "1GB" (for ByteSize) and "500ms", "1.2s", "3min" (for TimeDuration)

3. Store the value in canonical units (bytes for ByteSize, milliseconds or nanoseconds for TimeDuration)

Provide getter methods like getBytes() and getMilliseconds()


Prompt 3: Update Token Types and Directive Argument Support
Help me in extending a token parsing framework in Java for a data transformation tool. I need to:

Add two new token types: BYTE_SIZE and TIME_DURATION in the token registry or enum used (if any).

Update the logic that defines valid argument types in directives,
so that BYTE_SIZE and TIME_DURATION can be accepted where appropriate.

Mention any necessary updates in registration/configuration files or classes if applicable.



Prompt 4: Add Visitor Methods for New Parser Rules

In my ANTLR-based Java parser for a directive language,

I want to add two new parser rules: byteSizeArg and timeDurationArg. Help me in implementing:

1. To implement visitor methods visitByteSizeArg and visitTimeDurationArg in the appropriate visitor or parser class.

2. These methods should return instances of ByteSize and TimeDuration tokens respectively using ctx.getText().

Ensure these token instances are added to the TokenGroup for the directive being parsed.



Prompt 5: Implement New AggregateStats Directive

I’m creating a new directive class called AggregateStats in a Java-based data transformation engine. Help me in implementing the Directive Interface:

1. Accept at least 4 arguments:

Source column (byte sizes)

Source column (time durations)

Target column for total size

Target column for total/average time

2. Optionally accept:

Aggregation type (total, avg)

Output unit (MB, GB, seconds, minutes)

In initialize, store the argument values

In execute, use ExecutorContext.getStore() to:

Accumulate byte size and time duration values (convert to canonical units)

In finalize, return a single Row with converted results (e.g., MB, seconds)


Prompt 6: Write Unit Tests for ByteSize and TimeDuration

I need to write JUnit tests for one Java class: ByteSize and TimeDuration.

1. These class parse strings like "10KB" and "500ms" respectively.

2. Test valid cases: "10KB", "1.5MB", "1GB" for ByteSize and "500ms", "2s", "1min" for TimeDuration.

3. Verify that getBytes() or getMilliseconds() return the correct canonical values.

Include a few invalid input tests and assert that they throw proper exceptions.




Prompt 7: Write Parser Tests for New Grammar

I’ve added BYTE_SIZE and TIME_DURATION tokens to an ANTLR grammar. Help me write parser tests in Java to:

Validate that inputs like "10KB", "1.5MB", "5ms", "3min" are accepted in directive recipes.

Use test classes like GrammarBasedParserTest.java or RecipeCompilerTest.java.

Also test invalid values (e.g., "10KBB", "1..5MB", "ms5") and ensure they are rejected.




Prompt 8: Write Integration Test for AggregateStats Directive

I’ve created an AggregateStats directive that aggregates byte size and time duration columns. Help me write an integration test using TestingRig to:

Create input data: List<Row> with columns like data_transfer_size and response_time using values like "1MB", "500KB", "2s", "500ms".

Define recipe like:

java

String[] recipe = new String[] {
"aggregate-stats :data_transfer_size :response_time total_size_mb total_time_sec"
};
Execute with TestingRig.execute(recipe, rows)

Assert that the resulting row contains correct aggregated values (in MB and seconds)

Use a delta tolerance (e.g., 0.001) for comparing float values
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
/*
* Copyright © 2024 Cask Data, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/

package io.cdap.wrangler.api.parser;

import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import io.cdap.wrangler.api.annotations.PublicEvolving;

/**
* Represents a byte size value with units (KB, MB, GB, TB).
*/
@PublicEvolving
public class ByteSize implements Token {
private final String value;
private final long bytes;

public ByteSize(String value) {
this.value = value;
this.bytes = parseBytes(value);
}

private static long parseBytes(String value) {
String number = value.replaceAll("[^0-9.]", "");
String unit = value.replaceAll("[0-9.]", "").toLowerCase();
double size = Double.parseDouble(number);

switch (unit) {
case "kb":
return (long) (size * 1024);
case "mb":
return (long) (size * 1024 * 1024);
case "gb":
return (long) (size * 1024 * 1024 * 1024);
case "tb":
return (long) (size * 1024L * 1024L * 1024L * 1024L);
default:
return (long) size; // Base unit bytes
}
}

@Override
public String value() {
return value;
}

public long getBytes() {
return bytes;
}

@Override
public TokenType type() {
return TokenType.BYTE_SIZE;
}

@Override
public JsonElement toJson() {
JsonObject object = new JsonObject();
object.addProperty("type", TokenType.BYTE_SIZE.name());
object.addProperty("value", value);
object.addProperty("bytes", bytes);
return object;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
/*
* Copyright © 2024 Cask Data, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/

package io.cdap.wrangler.api.parser;

import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import io.cdap.wrangler.api.annotations.PublicEvolving;

/**
* Represents a time duration value with units (ms, s, m, h, d).
*/
@PublicEvolving
public class TimeDuration implements Token {
private final String value;
private final long milliseconds;

public TimeDuration(String value) {
this.value = value;
this.milliseconds = parseMilliseconds(value);
}

private static long parseMilliseconds(String value) {
String number = value.replaceAll("[^0-9.]", "");
String unit = value.replaceAll("[0-9.]", "").toLowerCase();
double duration = Double.parseDouble(number);

switch (unit) {
case "ms":
return (long) duration;
case "s":
return (long) (duration * 1000);
case "m":
case "min":
return (long) (duration * 60 * 1000);
case "h":
return (long) (duration * 60 * 60 * 1000);
case "d":
return (long) (duration * 24 * 60 * 60 * 1000);
case "us":
return (long) (duration / 1000.0); // Convert microseconds to milliseconds
case "ns":
return (long) (duration / 1000000.0); // Convert nanoseconds to milliseconds
default:
return (long) duration; // Default case
}

}

@Override
public String value() {
return value;
}

public long getMilliseconds() {
return milliseconds;
}

@Override
public TokenType type() {
return TokenType.TIME_DURATION;
}

@Override
public JsonElement toJson() {
JsonObject object = new JsonObject();
object.addProperty("type", TokenType.TIME_DURATION.name());
object.addProperty("value", value);
object.addProperty("milliseconds", milliseconds);
return object;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -152,5 +152,11 @@ public enum TokenType implements Serializable {
* Represents the enumerated type for the object of type {@code String} with restrictions
* on characters that can be present in a string.
*/
IDENTIFIER

BYTE_SIZE,

TIME_DURATION,


IDENTIFIER;
}
Loading