data-integrations · UtRaj · Apr 17, 2025 · Apr 17, 2025 · Apr 17, 2025 · Apr 17, 2025
diff --git a/prompts.txt b/prompts.txt
@@ -0,0 +1,132 @@
+# AI Tooling Prompts Used
+
+Prompt 1 : Initial Implementation
+I am modifying an ANTLR grammar for a Java project. Help me write lexer and parser rules in Directives.g4 file for 
+two new tokens:
+
+BYTE_SIZE → matches values like "10KB", "2.5MB", "1GB".
+
+TIME_DURATION → matches values like "5ms", "3.2s", "1min".
+
+Also, include helpful fragments like BYTE_UNIT, TIME_UNIT. Finally, show how to update the value parser rule 
+(or create byteSizeArg, timeDurationArg if needed) so the new tokens are accepted as directive arguments.
+
+
+
+Prompt 2: Create ByteSize and TimeDuration Token Classes
+ Help me in building two new token classes:
+
+ByteSize.java and TimeDuration.java
+
+Each class should:
+
+1. Extension of io.cdap.wrangler.api.parser.Token
+
+2. Parse strings like "10KB", "2.5MB", "1GB" (for ByteSize) and "500ms", "1.2s", "3min" (for TimeDuration)
+
+3. Store the value in canonical units (bytes for ByteSize, milliseconds or nanoseconds for TimeDuration)
+
+Provide getter methods like getBytes() and getMilliseconds()
+
+
+Prompt 3: Update Token Types and Directive Argument Support
+Help me in extending a token parsing framework in Java for a data transformation tool. I need to:
+
+Add two new token types: BYTE_SIZE and TIME_DURATION in the token registry or enum used (if any).
+
+Update the logic that defines valid argument types in directives, 
+so that BYTE_SIZE and TIME_DURATION can be accepted where appropriate.
+
+Mention any necessary updates in registration/configuration files or classes if applicable.
+
+
+
+Prompt 4: Add Visitor Methods for New Parser Rules
+
+In my ANTLR-based Java parser for a directive language, 
+
+I want to add two new parser rules: byteSizeArg and timeDurationArg. Help me in implementing:
+
+1. To implement visitor methods visitByteSizeArg and visitTimeDurationArg in the appropriate visitor or parser class.
+
+2. These methods should return instances of ByteSize and TimeDuration tokens respectively using ctx.getText().
+
+Ensure these token instances are added to the TokenGroup for the directive being parsed.
+
+
+
+Prompt 5: Implement New AggregateStats Directive
+
+I’m creating a new directive class called AggregateStats in a Java-based data transformation engine. Help me in implementing the Directive Interface:
+
+1. Accept at least 4 arguments:
+
+Source column (byte sizes)
+
+Source column (time durations)
+
+Target column for total size
+
+Target column for total/average time
+
+2. Optionally accept:
+
+Aggregation type (total, avg)
+
+Output unit (MB, GB, seconds, minutes)
+
+In initialize, store the argument values
+
+In execute, use ExecutorContext.getStore() to:
+
+Accumulate byte size and time duration values (convert to canonical units)
+
+In finalize, return a single Row with converted results (e.g., MB, seconds)
+
+
+Prompt 6: Write Unit Tests for ByteSize and TimeDuration
+
+I need to write JUnit tests for one Java class: ByteSize and TimeDuration.
+
+1. These class parse strings like "10KB" and "500ms" respectively.
+
+2. Test valid cases: "10KB", "1.5MB", "1GB" for ByteSize and "500ms", "2s", "1min" for TimeDuration.
+
+3. Verify that getBytes() or getMilliseconds() return the correct canonical values.
+
+Include a few invalid input tests and assert that they throw proper exceptions.
+
+
+
+
+Prompt 7: Write Parser Tests for New Grammar
+
+I’ve added BYTE_SIZE and TIME_DURATION tokens to an ANTLR grammar. Help me write parser tests in Java to:
+
+Validate that inputs like "10KB", "1.5MB", "5ms", "3min" are accepted in directive recipes.
+
+Use test classes like GrammarBasedParserTest.java or RecipeCompilerTest.java.
+
+Also test invalid values (e.g., "10KBB", "1..5MB", "ms5") and ensure they are rejected.
+
+
+
+
+Prompt 8: Write Integration Test for AggregateStats Directive
+
+I’ve created an AggregateStats directive that aggregates byte size and time duration columns. Help me write an integration test using TestingRig to:
+
+Create input data: List<Row> with columns like data_transfer_size and response_time using values like "1MB", "500KB", "2s", "500ms".
+
+Define recipe like:
+
+java
+
+String[] recipe = new String[] {
+  "aggregate-stats :data_transfer_size :response_time total_size_mb total_time_sec"
+};
+Execute with TestingRig.execute(recipe, rows)
+
+Assert that the resulting row contains correct aggregated values (in MB and seconds)
+
+Use a delta tolerance (e.g., 0.001) for comparing float values
diff --git a/wrangler-api/src/main/java/io/cdap/wrangler/api/parser/ByteSize.java b/wrangler-api/src/main/java/io/cdap/wrangler/api/parser/ByteSize.java
@@ -0,0 +1,77 @@
+/*
+ * Copyright © 2024 Cask Data, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not
+ * use this file except in compliance with the License. You may obtain a copy of
+ * the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+ package io.cdap.wrangler.api.parser;
+
+ import com.google.gson.JsonElement;
+ import com.google.gson.JsonObject;
+ import io.cdap.wrangler.api.annotations.PublicEvolving;
+
+ /**
+  * Represents a byte size value with units (KB, MB, GB, TB).
+  */
+ @PublicEvolving
+ public class ByteSize implements Token {
+   private final String value;
+   private final long bytes;
+
+   public ByteSize(String value) {
+     this.value = value;
+     this.bytes = parseBytes(value);
+   }
+
+   private static long parseBytes(String value) {
+     String number = value.replaceAll("[^0-9.]", "");
+     String unit = value.replaceAll("[0-9.]", "").toLowerCase();
+     double size = Double.parseDouble(number);
+
+     switch (unit) {
+       case "kb":
+         return (long) (size * 1024);
+       case "mb":
+         return (long) (size * 1024 * 1024); 
+       case "gb":
+         return (long) (size * 1024 * 1024 * 1024);
+       case "tb":
+         return (long) (size * 1024L * 1024L * 1024L * 1024L);
+       default:
+         return (long) size; // Base unit bytes
+     }
+   }
+
+   @Override
+   public String value() {
+     return value;
+   }
+
+   public long getBytes() {
+     return bytes;
+   }
+
+   @Override
+   public TokenType type() {
+     return TokenType.BYTE_SIZE;
+   }
+
+   @Override
+   public JsonElement toJson() {
+     JsonObject object = new JsonObject();
+     object.addProperty("type", TokenType.BYTE_SIZE.name());
+     object.addProperty("value", value);
+     object.addProperty("bytes", bytes);
+     return object;
+   }
+ }
diff --git a/wrangler-api/src/main/java/io/cdap/wrangler/api/parser/TimeDuration.java b/wrangler-api/src/main/java/io/cdap/wrangler/api/parser/TimeDuration.java
@@ -0,0 +1,85 @@
+/*
+ * Copyright © 2024 Cask Data, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"); you may not
+ * use this file except in compliance with the License. You may obtain a copy of
+ * the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+ package io.cdap.wrangler.api.parser;
+
+ import com.google.gson.JsonElement;
+ import com.google.gson.JsonObject;
+ import io.cdap.wrangler.api.annotations.PublicEvolving;
+
+ /**
+  * Represents a time duration value with units (ms, s, m, h, d).
+  */
+ @PublicEvolving
+ public class TimeDuration implements Token {
+   private final String value;
+   private final long milliseconds;
+
+   public TimeDuration(String value) {
+     this.value = value;
+     this.milliseconds = parseMilliseconds(value);
+   }
+
+   private static long parseMilliseconds(String value) {
+     String number = value.replaceAll("[^0-9.]", "");
+     String unit = value.replaceAll("[0-9.]", "").toLowerCase();
+     double duration = Double.parseDouble(number);
+
+     switch (unit) {
+      case "ms":
+         return (long) duration;
+      case "s":
+         return (long) (duration * 1000);
+      case "m":
+      case "min":
+         return (long) (duration * 60 * 1000);
+      case "h":
+         return (long) (duration * 60 * 60 * 1000);
+      case "d":
+         return (long) (duration * 24 * 60 * 60 * 1000);
+      case "us":
+         return (long) (duration / 1000.0); // Convert microseconds to milliseconds
+      case "ns":
+         return (long) (duration / 1000000.0); // Convert nanoseconds to milliseconds
+      default:
+         return (long) duration; // Default case
+}
+
+   }
+
+   @Override
+   public String value() {
+     return value;
+   }
+
+   public long getMilliseconds() {
+     return milliseconds;
+   }
+
+   @Override
+   public TokenType type() {
+     return TokenType.TIME_DURATION;
+   }
+
+   @Override
+   public JsonElement toJson() {
+     JsonObject object = new JsonObject();
+     object.addProperty("type", TokenType.TIME_DURATION.name());
+     object.addProperty("value", value);
+     object.addProperty("milliseconds", milliseconds);
+     return object;
+   }
+ }
diff --git a/wrangler-api/src/main/java/io/cdap/wrangler/api/parser/TokenType.java b/wrangler-api/src/main/java/io/cdap/wrangler/api/parser/TokenType.java
@@ -152,5 +152,11 @@ public enum TokenType implements Serializable {
    * Represents the enumerated type for the object of type {@code String} with restrictions
    * on characters that can be present in a string.
    */
-  IDENTIFIER
+
+  BYTE_SIZE,
+
+  TIME_DURATION,
+
+
+  IDENTIFIER;
 }